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Part  II  Abstract  contd. 


The  purpose  of  this  research  is  to  develop  anjl 
analyze  a  gradient-descent  surface-searching  al¬ 
gorithm  for  automatically  adjusting  (adapting)  the| 
parameters  of  a  linear  tapped-delay-llne  array 
processor  in  order  to  improve  its  performance  in 
an  unknown  changing  environment.  .The  tracking 
ability  of  this  algorithm  is  demonstrated  when  the} 
characteristics  of  the  nonstatlonarity  are  such 
that  the  optimum  parameter  sequence  can  be  modeled) 
as  a  first-order  Markov  process  with  a  known  tram 
itlon  function.  A  worst-case  analysis  of  the 
algorithm  is  presented  for  three  types  of  non- 
stationarities  when  the  above  model  for  the  non- 
stationarity  Is  not  applicable. 

The  techniques  developed  in  analyzing  the 
above  algorithm  provide  a  powerful  approach  for  tije 
further  study  of  gradient-descent  algorithms  used 
unknown,  nonstatianary  surfaces.  Among  the  most 
sequences  are: 

l)  the  removal  of  the  visual  assumption  that 
data  be  Jointly  Gaussian; 

li)  the  development  of  a  new  convergence 
for  a  dynamic  stochastic  approximation  algorithm, 
thereby  extending  a  branch  of  stochastic  approxim^t 
theory  to  the  analysis  of  adaptive  processors  in 
stationary  statistics; 

ill)  the  enlargement  of  the  class  of  problems 
which  stochastic  approximation  algorithms,  adaptive 
estimation  algorithms,  and  the  Kalman-Bucy  theory 
be  compared. 

Also  presented  In  an  appendix  is  a  procedure 
for  automatically  adjusting  the  convergence  fac¬ 
tor.  Some  experimental  results  sure  presented. 
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PART  I 


j^ie  problem  considered  in  this  report  is  to  find  the  vector  of  weights  W  minimizing 

E([d(t)  -  WTX(t)]2}  subject  to  linear  equality  constraints  on  W,  where  X(t)  is  a 
vector  of  random  variables  measured  at  time  t  and  d(t)  is  a  random  variable  related 
to  X(t).  This  is  a  classical.  problem  in  linear  estimation  theory,  except  that  the 
statistics  of  the  random  variables  are  assumed  unknown  and  must  be  learned  through  ob¬ 
servations.  A  computationally  simple  procedure,  called  the  Constrained  Least-Mean- 
Squares  algorithm,  is  proposed  for  processing  the  observations  and  is  shown  to  con¬ 
verge  to  the  optimal  linear  processor.  The  algorithm  is  useful  in  real-time 

modeling,  filtering,  and  estimation,  particularly  in  cases  where  the  optimal  time- 
varying  linear  processor  (e.g.,  Kalman  filter)  cannot  be  used  because  of  computational 
complexity  or  lack  of  necessary  information  about  the  system.  Special  attention  is 
given  to  real-time  processing  of  data  from  an  array  of  sensors,  and  it  is  shown  that 
»he  Constrained  Least-Mean -Squares  algorithm  permits  implementation  of  an  array  pro¬ 
cessor  the.t  requires  very  little  a  priori  statistical  information. 

PART  II 

21.  *he  classical  design  of  processors  for  sensor  arrays  whose  purpose  is  signal  de¬ 
flection  and  estimation,  a  receiver  is  optimized  on  the  basis  of  the  a  priori  knowledge 
of  the  statistics  of  its  input  signals.  However,  when  the  a  priori  knowledge  is  not 
available,  the  re  ceiver's  performance  can  still  be  improved  by  performing  measurements 
on  its  input  signals  and  incorporating  this  new  Information  into  its  design.  Such  re¬ 
ceivers  are  called  adaptive.  (contd.  on  back) _ _ 
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FOREWORD 


This  is  a  Final  Report  under  Contract  N00024-69-C-1430,  covering 
a  period  of  research  from  27  June  1969  to  30  April  1970.  The  report 
consists  of  two  parts: 

Part  I,  by  0.  L.  Frost,  III,  is  based  on  his  Ifc.D.  thesis.  The 
principal  contribution  is  a  new  adaptive  algorithm  for  minimizing  mean- 
square-error  in  cm  adaptive  processor  which  simultaneously  subjects  the 
weight-vector  components  to  a  linear  equality  constraint.  When  applied 
to  adaptive  .".'rays,  this  algorithm  allows  one  to  obtain  precise  control 
of  the  array  frequency  response  and  gain  level  in  the  "look  direction" 
while  minimizing  mean  -  8  qviare- error.  Frost's  algorithm  is  probably  the 
best  yet  devised  for  adaptive  arrays. 

Part  II,  by  James  Edward  Brown,  III,  is  based  on  his  Fb.D.  thesis. 
This  work  is  highly  theoretical  and  presents  a  framework  for  mathemati¬ 
cal  analysis  of  adaptive  processors  when  subjected  to  changing  (non¬ 
stationary)  signal  and  noise  fields.  For  various  adaptive  algorithms, 
rate  of  convergence  and  variance  of  the  weight  vectors  are  analyzed. 
This  work  is  general,  and  applicable  to  a  wide  variety  of  adaptive 
signal  processors. 
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ADAPTIVE  LEAST  SQUARES  OPTIMIZATION 
SUBJECT  TO  LINEAR  EQUALITY  CONSTRAINTS 

by 

Otis  Lamont  Frost,  III 
ABSTRACT 

The  problem  considered  in  this  report  is  to  find  the  vector  of 
weights  W  minimizing  E([d(t)  -  W^X(t)l^)  subject  to  linear  equality 
constraints  on  W,  where  X(t)  is  a  vector  of  random  variables 
measured  at  time  t  and  d(t)  is  a  random  variable  related  to  X(t). 

This  is  a  classical  problem  in  linear  estimation  theory,  except  that 
the  statistics  of  the  random  variables  are  assumed  unknown  and  must  be 
learned  through  observations.  A  computationally  simple  procedure, 
called  the  Constrained  Least-Mean-Squares  algorithm,  is  proposed  for 
processing  the  observations  and  is  shown  to  converge  to  the  optimal 
linear  processor. 

The  algorithm  is  useful  in  real-time  modeling,  filtering,  and 
estimation,  particularly  in  cases  where  the  optimal  time-varying  linear 
processor  (e.g.,  Kalman  filter)  cannot  be  used  because  of  computational 
complexity  or  lack  of  necessary  information  about  the  system.  Special 
attention  is  given  to  real-time  processing  of  data  from  an  array  of  sensors, 
and  it  is  shown  that  the  Constrained  Least -Mean-Square b  algorithm  permits 
implementation  of  an  array  processor  that  requires  very  little  a  priori 
statistical  information. 
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I.  INTRODUCTION 


This  paper  presents  a  simple  algorithm  for  minimizing 
a  quadratic  cost  criterion  subject  to  linear  equality 
constraints.  The  technique,  called  the  "Constrained- Least- 
Mean  Squares"  or  "Constrained  IMS"  algorithm  is  an  iterative, 
stochastic  gradient- descent  algorithm  with  low  memory 
requirements.  Computationally,  it  is  simple  enough  that  for 
a  variety  of  practical  problems  it  can  be  implemented  in 
real  time  on  a  small  general-purpose  computer. 

The  algorithm  is  applicable  to  problems  in  least  squares 
filtering,  estimation,  modeling,  and  others  which  may 
properly  be  viewed  as  linear- constrained  quadratic  optimi¬ 
zation  problems.  Specific  examples  treated  in  the  paper 
include  real-time  minimum- variance  unbiased  estimation, 
consistent  modeling  that  includes  known  linear  constraints 
on  the  model  parameters,  and  real-time  processing  of  data 
from  an  array  of  antennas  or  other  sensors.  The  constrained 
least- meanr-  squares  approach  is  particularly  interesting  in 
the  estimation  and  array  processing  applications  because 
it  requires  very  little  a  priori  information  for  implementation. 

The  rate  of  convergence  of  the  algorithm  is  studied  and 
its  steady- state  performance  is  compared  with  the  optimum. 

A  gain  constant  is  shown  to  control  a  tradeoff  between  fastest 
convergence  rate  and  best  steady- state  performance.  By 
suitable  choice  of  gain  the  steady- state  performance  of  the 
algorithm  can  be  made  arbitrarily  close  to  the  performance 
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of  the  optimum  least- squares  filter. 

Previous  work  on  unconstrained  least- squares  array 
processing  was  done  by  Griffiths  [12]?  his  method  requires 

%■ 

knowledge  of  second-order  signal  statistics.  Widrow, 

et  al.  [30]  proposed  a  variable- criterion  optimization 

procedure  involving  the  use  of  a  known  training  signal; 

this  was  a  direct  application  of  the  original  work  on  adaptive 

filters  done  by  Widrow  and  Hoff  [29],  Griffiths  also 

proposed  a  constrained  least-mean- squares  processor  not 

requiring  a  priori  knowledge  of  the  signal  statistics  [11]? 

a  new  derivation  of  this  processor,  given  in  Appendix  A,  * 

shows  that  it  may  be  considered  as  putting  "soft"  constraints 

on  the  processor  via  the  quadratic  penalty  function  method.  t 

"Hard"  (i.e.,  exactly)- constrained  iterative  optimi¬ 
zation  was  studied  by  Rosen  [23]  for  the  deterministic  case. 

Lacoss  [14]  and  Booker  [1]  studied  "hard "-cons trained 
stochastic  optimization  in  the  array  processing  context. 

All  three  authors  u3ed  "gradient  projection"  techniques; 

Rosen  and  Booker  correctly  indicate  that  gradient  projection 
methods  are  susceptible  to  cumulative  roundoff  errors  and 
are  not  suitable  for  long  runs  without  an  additional  error- 
correction  procedure.  The  Constrained  IMS  algorithm  is 

i 

designed  to  avoid  error  accumulation  while  maintaing  a 
"hard"  constraint?  as  a  result,  it  is  able  to  operate  con¬ 
tinually  in  order  to  track  an  environment  that  may  be  slowly 
time- varying.  Discussion  of  gradient- pro jection  methods  and 
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a  comparison  of  the  error- correcting  properties  of  the  two 
algorithms  is  given  in  Section  VII. 

*  In  the  following  section,  the  general  constrained 

least- mean- squares  problem  is  formulated  as  a  theorem  and 
the  optimal  solution  is  derived  under  the  assumption  that 
all  the  relevant  statistics  of  the  problem  are  known. 

Several  corollaries  applying  to  interesting  special  cases 
are  drawn.  The  optimal  solution  is  seen  to  be  computationally 
difficult,  requiring  a  number  of  matrix  multiplications  and 
>  inversions.  In  Section  III,  the  computationally  simple 

Constrained  IMS  algorithm  is  derived  that  converges  to  the 
optimal  solution  while  learning  the  statistics  of  the  problem. 
This  algorithm  and  studies  of  its  properties  is  the  principal 
result  of  this  thesis.  Special  forms  of  the  general  algorithm 
are  used  to  solve  particular  problems.  Remaining  sections 
are  concerned  with  geometrical  interpretation  of  the  algorithm, 
its  performance,  applications,  and  computer  simulations. 


II.  CONSTRAINED  LEAST- MEAN- SQUARES  OPTIMIZATION 


A.  Notation 

In  this  paper  a  vector  is  taken  to  be  a  column  vector. 

The  superscript  T  denotes  transpose.  The  expected  value 
of  a  quantity  (Q)  is  denoted  by  E  fQ)  or  0  .  The  matrix  of 
correlations  between  two  vectors  of  random  variables, 

A  and  B  ,  is  written  E(AB  }  *  R^  ;  the  vector  of  corre¬ 
lations  between  a  vector  X  and  a  scalar  d  is  written 
Efxd)  *  R^  .  A  vector  of  zeros  of  arbitrary  dimension  is 
9  and  the  matrix  of  zeros  is  .0 

B.  The  General  Problem  and  Optimal  Solution 

There  are  two  purposes  for  this  section.  The  first  is  to 
define  the  general  constrained  least-mean- squares  problem  and 
derive  the  optimal  solution.  This  solution  could  be  obtained 
directly  if  one  knew  the  problem  statistics  beforehand.  It 
will  be  shown  later  that  the  Constrained  LMS  algorithm  converges 
to  this  solution  and  can  be  used  when  the  problem  statistics 
are  unknown.  The  second  purpose  is  to  show  that  several  inte¬ 
resting  and  important  problems  can  be  put  in  the  framework  of 
the  general  constrained  IMS  problem  and  therefore  are  solvable 
by  the  algorithm. 

Let  X  be  a  vector  of  n  observed  data  points, 

X  =  (xi* x2#  •  •  • » xn)  ,  that  are  drawn  from  a  distribution  with 
Efxx  }  =  •  Let  d  be  a  random  variable  correlated  with  X 

by  an  n- dimensional  correlation  vector  R^  .  In  this  section 
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and  are  assumed  known.  Let  W  be  an  n- dimensional 

vector  of  weightings  that  will  be  applied  to  X  to  estimate 
d  .  Let  the  estimate  of  d  be 

y  =  wTx  ,  (2.1) 

and  the  error  between  d  and  the  estimate  be 

e  =  d-y  .  (2.2) 

The  constrained  least-mean- squares  optimization  problem  is 
to  find  the  weight  vector  W*  that  minimizes  the  expected 
squared  error  in  the  estimate, 

E(e2)  *  EfTd-y]2)  -  E[d-wTx]2)  (2.3) 

subject  to  certain  linear  equality  constraints  on  w  . 

The  reason  for  placing  constraints  on  W  was  suggested 
in  the  introduction  and  will  be  made  clear  in  the  applications. 
In  general,  m  linear  equality  constraints  (with  n  >m)  are 
placed  on  W  of  the  form 

T 

c±W  ■  f^  ,  i*l,2,...,m  ,  (2.4) 

where  each  c^  is  an  r>- dimensional  vector  and  each  f^  is  a 

scalar  constant.  This  is  a  set  of  m  simultaneous  equations 

which  the  n  components  of  W  must  satisfy,  but  since  m  <  n, 

the  equations  do  not  completely  determine  or  totally  constrain 

W  .  Therefore  W  can  be  optimized,  to  minimize  a  mean  square 

error,  subect  to  the  linear  constraint  (2.4).  It  is  well 

T 

known  that  by  requiring  W  to  satisfy  c^W  =  f^  for  any 
single  i  restricts  W  to  lie  in  an  (n-1)- dimensional  hypdr- 
plane.  Similarly,  it  is  shown  in  Section  IV  that  constraining 
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w  to  satisfy  the  m  equations  of  (2.4)  restricts  W  to 
an  (n-m)- dimensional  plane*  if  the  vectors  c^  are  linearly 
independent.  To  express  the  constraints  in  matrix  notation 
define 


c  _  ...  c  n  ,  if 

i  m  • 


A  f2 


(2.5) 


The  constraint  matrix  C  is  (nxm)  with  n>m  .  It  will 
be  assumed  that  the  constraint  vectors  ci  are  linearly 
independent  so  that  by  the  definition  of  rank  as  the  number 
of  linearly  independent  columns  of  a  matrix,  C  has  full 
rank  equal  to  m  .  The  constraints  (2.4)  are  now  written 


C*W  -  If  . 


(2.6) 


The  problem  is  summarized  and  the  solution  is  given  in  the 
form  of  a  theorem. 


Theorem  1.  (Constrained  Linear  Leastr-Mean- Squares  Optimi¬ 
zation)  Let  d  be  a  random  variable  and  X  be  an 
n- dimensional  vector  of  random  variables  with  known 
correlation  matrices 


*Other  names  for  an  " r- dimensional  plane"  are  "linear  variety" 


and  "Linear  manifold". 
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E(XXT)  =  (nxn) 

E(xd)  =  R^  (nxl) 

and  positive- definite.  The  optimum  constrained 

least-mean- squares  weight  vector  solving 

minimize  E([d-WTX]2) 

T  (2.7) 

subject  to  C  W  *  5 

where  C  is  an  (nxm)  matrix  (n  >m)  of  full  rank  and 
J  is  an  m- vector,  is 

w*  *  [  1  -  ‘Sex0  (cTbxxc  >'  lcTlRxxExa  +  "xx0  (cTrxxc  >'  **  • « • 8  > 

The  optimum  constrained  linear  least-mean- squares  estimate 
of  d  is  y  =  W*X  . 

Proof  of  Theorem  1. 

The  proof  uses  the  method  of  Lagrange  multipliers, 
which  is  basic  to  the  later  development  of  the  major  algorithm 
and  another  proof .  A  geometrical  interpretation  of  Lagrange 
multipliers  expressed  in  the  context  of  this  work  is  pre¬ 
sented  in  Appendix  E. 

The  cost  function  is  J(W)  =  E[[d-WTX]2} 

=  E(d2)  -  2  E {WTXd )  +  E(WTXXTW) 

=  E(d2)  -  2WTRxd  +  WTRxxW  .  (2.9) 
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Including  a  factor  of  ^  to  simplify  later  arithmetic,  adjoin 
the  constraint  function  to  the  cost  function  by  a 
m- dimensional  vector  of  undetermined  Lagrange  multipliers  X  : 

H(W)  =  JsJ(W)  +  *T(CTW-n 

=  ^[Ed2  -  2WTRX(J  +  WTRxxW]  +  XT  (CTW-  3F )  .  (2.10) 

The  necessary  conditions  for  optimality  are 

V^O*)  *  e  .  (2.11) 

and 

CTW  -  *  .  (2.12) 


Taking  the  gradient  of  (2.10)  with  respect  to  W 

V^fW)  *  +  R^W^+CX  *  9  (2.13) 

and  solving  for  the  optimal  weight  vector 

«.  -  'ScXd  -  ^cX  •  (2-i4) 

exists  because  was  assumed  positive 

definite.  Since  W*  must  satisfy  the  constraint  (2.12) 

cTw.  -  *  -  cXkta  -  (2-15) 

and  from  (2.15)  X  is  found  to  be 

A  *  [cTRxxC,_1(cTlScXRXd  “  *  1  *  {2‘16) 

T  -  1  -  1 

It  is  shown  in  Appendix  C  that  the  existence  of  [C  R^C] 
follows  from  the  facts  that  is  positive  definite  and 


where 


% 


% 
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C  has  full  rank.  Substituting  the  last  expression  for  the 
Lagrange  multipliers  into  the  expression  for  W*  (2.14) 
the  result  follows. 

This  completes  the  proof  of  Theorem  1. 

C.  Special  Cases 

A  well-known  special  case  of  Theorem  1  is  the  uncon¬ 
strained  least- squares  problem. 

Corollary  1.1.  (Least- Mean- Square  Error — Wiener)  The 
optimum  set  of  weights  W*  solving  the  problem 
defined  by  Theorem  1  without  constraints,  i.e., 

minimize  E{[d-WTX]2}  (2.17) 

with 

E(XXT)  -  R^ 

E(xa)  =  R^ 

is 

W*x  =  RXXRXd  *  (2.18) 

And  the  best  unconstrained  estimate  of  d  is 
T 

Y  =  wJlX  • 
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Proof  of  Corollary  1.1. 

Let  the  constraint  matrix  C  vanish  in  Theorem  1. 
See  especially  Eq.  (2.14)  of  the  proof. 

This  completes  the  proof  of  Corollary  1.1. 


A  second  well-known  problem  that  can  be  formulated  as 
a  special  case  of  Theorem  1  is  the  distortionless  least-mean- 
squares  estimation  problem  that  was  solved  by  Gauss. 

Corollary  1.2.  (Least- Mean- Squares  Distortionless 

Estimate —  Gauss,  Markov)  Let  the  data  vector  X 
be  of  the  form 

X  -  CB  +  N  ,  (2.19) 

where  C  is  a  known  (nxm)  matrix  of  rank  m  ,  and  B  is  an 

T 

unknown  m-dimensional  vector  with  B  *  b,b-...b  . 

I  I  f  JBl 

B  may  be  a  vector  of  random  variables  with  unknown 
mean  (so  E(b)  =B)  ,  or  it  may  be  a  vector  of  unknown 
parameters,  in  which  case  E{b)  *  B  *  B  .  N  is  an 
unknown  n- dimensional  vector  of  random  variables 
considered  as  noise.  B  and  N  are  uncorrelated,  with 


e(bbM  .  Rjg 

(m  x  m) 

E{n)  =  Q 

(n  x  1) 

e(NNT)  -  Rjjjj 

(n  x  n) 

e(bnt)  =  0 

(m  x  n) 

and  is  positive  definite. 
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Fig.  2.1.  The  estimation  problem  of  Corollary  1.2.  Thick 
lines  indicate  vector- valued  quantities,  w  is 
chosen  so  that  y  is  an  estimate  of  the  itl1 
component  of  B  ,  b.  . 
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Fig.  2.2.  Manipulation  of  the  flow  charts  from  Fig.  2.1 
yields  (A).  Constraining  WTC  =  jT  yields  (B), 
showing  that  the  constraint  puts  a  unity  transfer 
function  on  b^  and  that  y  =  b^  +  WTN  . 
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Thus 

*XX  =  CRBBC  +  ^  ‘ 

The  problem  is  to  make  a  linear  least- squares  estimate 

T 

of  b^  ,  say  y*W  X,  that  ia  unbiased  (see  Fig.  2.1). 

We  wish  the  estimate  of  b^  to  be  corrupted  only  by 
the  minimum  amount  of  zero- mean  noise.  The  optimum 
weight  vector  solving  the  problem 

minimize  E{[b^  -  WTX]2) 

subject  to  E(WTX-bi)*0  (2.20) 


is 

V  - 

4* 


(2.21) 


where 


s  * 


position 


(2.22) 


T 

and  the  best  unbiased  estimate  of  b.  is  y=W*  X  . 

1  2 


Proof  of  Corollary  1.2. 

The  problem  (2.20)  is  put  into  the  form  of  the  problem 
solved  by  Theorem  1.  Observe  that  =  STB  .  Using  (2.19) 
X  =  CB  +  N  ,  and  the  fact  that  N  is  zero  mean,  we  have 

E  (WTX  -  b±  )  =  E  {WTCB  +  WTN  -  )  =  E  (WTCB  -  )  . 

Now  if  we  require  CTW=?  then 
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E{WTX-bi)  =  EfA-b^  =0  (2.23) 

and  the  constraint  of  (2.20)  is  satisfied.  This  is  reasonable 
since  the  constraint  applies  a  unit  transfer  function  to  bi 
(see  Fig.  2.2). 

T 

Further,  with  the  constraint  C  W  =  y  in  force 
y  s  wTX  *  WTCB  +  WtN  =  bi  +  and  the  cost  function  becomes 

E ( [bi -  y]  2 }  =  E (b2 }  -  2E(biy)  +  E{y2) 

=  E{b?)  -  2E(bi(bi+WTN))  +  E(y2) 

=  E Cy2 }  -  E{b2)  .  (2.24) 

Because  E{b?)  is  a  constant,  the  weight  vector  that  mini- 

2  2  2 
mizes  E{y  )  also  minimizes  E{y*}-E{bf)  ,  so  the  problem 

(2.20)  reduces  to 

minimize  E(y2}  =  E{[WTX]2) 

subject  to  CTW  =  ?  ,  (2.25) 

where  C  is  defined  in  (2.19),  and  5  in  (2.22).  This  is 
a  special  case  of  the  problem  of  Theorem  1  with  d  =  0  . 

Since  d  =  0  ,  E(xd)  =1*^  =  0  and  so  the  first  term  of  (2.8) 
vanishes  and  the  optimal  solution  becomes  the  second  term. 

This  completes  the  proof  of  Corollary  1.2. 

In  some  cases  the  unbiased  estimator  may  not  be  the 
most  desirable.  Suppose  that  (as  in  the  array- processing 
problem  discussed  later)  B  is  the  sum  of  two  vectors 

B  =  A  +  A  ,  (2.26) 
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% 


V 


Fig.  2.3.  B  is  a  given  structure  for  Corollary  1.3. 

B  is  the  sum  of  signals  &  plus  noise  A  . 
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where  <4  and  A  are  m-dimensional  vectors  of  random  variables 

which  may  be  statistically  correlated.  f>  is  to  be  thought 

of  as  a  vector  of  "signals",  one  of  which  we  wish  to  estimate, 

say  s^  .  A  is  to  be  considered  as  a  vector  of  additive 

noise.  Note  that  A  and  N  are  both  "noise"  vectors  of 

different  dimension  (see  Fig.  2.3). 

Since  the  "unbiased  estimator"  of  Corollary  1.2  forms 

an  estimate  of  b^  ,  which  is  equal  to  s^  plus  a.  ,  it 

may  not  be  satisfactory  as  an  estimator  of  s^  alone. 

Another  approach  is  to  recall  from  Corollary  1.2  that 

by  suitable  choice  of  the  constraints  a  vector  5  can  bo 

applied  directly  to  B  .  Therefore  a  "filter"  vector  5 

(which  may  be  different  from  (2.22))  may  be  designed  to  use 

the  correlation  among  the  components  of  &  and  the  (hopefully 

different)  correlations  among  the  components  of  A  ,  so  that 

when  3  is  applied  to  B  it  may  enhance  s^  in  the  output 

and  discriminate  against  A  .  This  is  exactly  analogous  to 

the  use  of  a  filter  in  the  frequency  domain  to  pass  signals 

and  discriminate  against  noises. 

In  the  following,  it  is  assumed  that  J  is  a  vector 

chosen  by  the  user.  The  best  choice  of  5  is  a  topic  with 

which  we  do  not  wish  to  get  deeply  involved.  An  example  of 

i  a  choice  of  5  is  given  in  Example  3,  Section  VI.  If  the 

T 

weight  vector  W  is  constrained  to  satisfy  C  W  =  5  ,  then 
the  output  is 

y  =  WTX  =  WTCB  +  WTN  =  ?TB  +  WTN  ,  (2.27) 


and  output  power  is 
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E(y2}  =  STRbB5  +  wTrnNW  •  (2.28) 

i 

Because  B  and  N  are  uncorrelated  there  is  no  cross  term 
and  so  long  as  W  satisfies  the  constraint  any  permissible 

•  •  •  w 

variation  m  W  affects  only  the  power  of  the  noise  in  the 

output.  Thus  the  "degrees  of  freedom"  of  W  not  constrained 
T 

by  C  W  *  5  may  be  used  to  minimize  the  excess  noise  power  in 
the  estimate  of  s^  . 

With  the  preceding  motivation,  the  problem  is  set  up 
as  a  special  case  of  Theorem  1. 

Corollary  1.3.  (Least- Mean- Squares  Filtered  Estimate) 

Let  X  be  a  known  n- dimensional  vector  of  observations 
of  the  form 

« 

X  =  CB  +  N  , 

where  C  is  a  known  (nxm)  matrix  with  (n  >m) . 

B  and  N  are  unknown  vectors  of  random  variables 
with  dimensions  m  and  n  respectively.  Let  B  be 
o  f  the  form 

B  *  A  +  A  , 

where  A  and  A  are  m- dimensional  vectors  of  random 
variables.  We  wish  to  form  an  estimate  of  the  ith 
element  of  *  ,  . 
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<4 


» 


j 


e{n}  =  e 

E(m»T)  = 

e[bnt)  «  0 

EtxxT)  -  Rjjx  . 


Let  f  be  a  given  m- dimensional  filter  vector.  The 
least-mean- squares  filtered  estimator  is  the  weight 
vector  solving  the  problem 


minimize  Efy2}  =  E{[WTX]2} 
T 

subject  to  C  W  =  5F 


(2.29) 


and  is 

w*  "  RxxcfcTRxxCJ~15  (2.30) 

3 

The  best  filtered  estimate  of  s.  is  y=W*  X  . 

1  3 

Proof  of  Corollary  1.3. 

Proof  follows  directly  from  Theorem  1,  with  d=0  and  hence 
Rxd  *  9  .  The  fact  that  y  ■  W*  X  is  an  estimate  for  si 
follows  from  the  above  discussion. 

This  completes  the  Proof  of  Corollary  1.3. 


Remark :  If  ?  is  chosen  as  in  (2.22)  (one  unit  entry  and 

the  rest  zeros)  the  solution  is  the  same  as  the  solution 
of  Corollary  1.2. 


III.  THE  ADAPTIVE  ALGORITHM 


I 

A.  The  Unknown  Statistics  Problem 

Suppose  now  that  the  correlation  matrices  and 

W 

Rxd  required  by  Theorem  1  are  not  known  a  priori.  Instead 
a  sequence  of  observation  vectors  (X  (0)  ,X  (1) , . . .  ,X  (k) , . . . } 
is  presented,  each  vector  drawn  independently  from  a  quasi¬ 
stationary  ergodic  distribution  with  autocorrelation  R^  . 

A  sequence  of  random  variables  (d  (0)  ,d  (1) ,  . . . , d  (k) , . . . ) 
which  are  related  to  the  X's  by  an  unknown  correlation 
vector  is  also  presented.  We  wish  to  minimize  the 

constrained  mean  square  error  of  the  problem  of  Theorem  1.  , 

An  obvious  solution  is  to  make  estimates  of  the  unknown 
correlation  matrices  from  observations,  e.g.,  # 

R^OO  *  a^xx (k ~  1)  +  (1-  a)X(k-  l)XT(k-  1)  , 

and 

R^OO  -«**,<*- 1}  +  (l-a)X(k-  l)d(k- 1)  , 

0  <  a  <  1  , 

and  insert  these  estimates  into  the  expression  for  the 
optimal  weight  vector  given  by  Theorem  1,  Eq.  (2.8). 

Inspection  of  (2.8)  shows  that  because  of  the  number  of 

matrix  multiplications  and  inversions  involved,  a  great  deal  k 

of  computation  is  required  at  each  iteration  by  this  approach, 
ultimately  limiting  the  rate  at  which  estimates  can  be  made 

* 

and  the  dimensionality  of  a  system  of  given  cost.  See 
Appendix  F  for  an  example  of  the  performance  of  this  approach. 
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The  next  section  describes  a  computationally  simple 
procedure  (the  Constrained- IMS  algorithm)  that  converges  to 
the  weight  vector  Ww  that  solves  the  problem  posed  by 
Theorem  1  without  prior  knowledge  of  the  correlation  matrix 
R^  .  Further,  if  d  (k)  is  available  for  training,  or  is 
not  required  fcr  the  solution  (as  in  Corollaries  1.2  and  1.3) 
then  the  algorithm  does  not  require  knowledge  of  R^d  . 

B.  Derivation 

The  Constrained- IMS  algorithm  is  based  on  a  constrained 
gradient  descent,  satisfying  C^W  =5  at  all  times  while 
iterating  to  find  a  weight  vector  minimizing  the  cost  function 

J  (W)  -  -§E{[d  (k)  -  WTX(k)]2}  =  -§[Ed2-  2WTRxd  +  WTRxxW]  . 

For  motivation  of  the  derivation,  temporarily  suppose 
that  and 

form  the  function  H(W)  by  adjoining  the  constraint  to  the 
cost  function  by  a  m-dimensional  vector  of  Lagrange  multi¬ 
pliers  X  : 

H(W)  =  -|[Ed2-  2WTRxd  +  WTRXXW]  +  XT[CTW-y  ]  .  (3.1) 

As  in  Theorem  1,  we  wish  to  find  a  weight  vector  W*  such 

that  the  gradient  of  H  at  Ww  is  9  and  W*  satisfies 

T  .... 

C  W  =  5  .  The  gradient  descent  is  initialized  by  choosing 

a  weight  vector  W(0)  that  satisfies  the  constraint. 


Rxd  are  known.  As  in  the  proof  of  Theorem  1, 
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The  gradient  of  H  with  respect  to  W  is 

V  *  Rxxw-Rx<J  +  cX  •  (3’2) 

At  each  iteration  the  weight  vector  is  moved  in  the  direction  ^ 

of  the  negative  gradient.  (Note:  a  move  in  the  direction 
of  the  positive  gradient  tends  to  increase  a  cost  function.) 

The  length  of  the  step  is  proportional  to  the  magnitude  of 
the  gradient  and  scaled  by  a  gain  factor  p.  .  At  the  kth 
iteration  the  next  weight  vector  would  be 

W (k+1)  »  W(k)  -  PV^OO 

=  WOO-nfR^WOO-  Rxd  +  cX(k)]  .  (3.3) 

The  constrained  gradient  [R^W  (k)  -  +  CA  (k)]  is  the  • 

unconstrained  gradient 

V  -  RXXW(k)-RXd  '  (3-4) 

plus  the  term  cA(k)  .  As  noted  in  Appendix  E  and  later  in 
Section  IV,  the  vector  cMk)  is  orthogonal  to  the  constraint. 

By  proper  choice  of  ^  (k)  the  component  of  the  unconstrained 
gradient  normal  to  the  constraint  (and  hence  deviating  from 
it)  can  be  exactly  cancelled.  Thus  the  Lagrange  multipliers 
are  chosen  by  requiring  W(k+1)  to  satisfy  the  constraint 
*  =  CTW  : 

‘f  =  cTw  (k+i)  =  cTw(k)  -  uc^r^woO  +  ^cTRxd-  PCTC*  (k)  , 

(3.5) 
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and  solving  for  the  Lagrange  multipliers  for  the  kth 
iteration, 

*00  =  (CTC f  1CTRxxW  Ck )  -  (CTC f  1CTW  (k )  -  (CTC)'1CTRxd  , 

(3.6) 

where  it  is  shown  in  Appendix  C  that  the  existence  of 
T  - 1 

(C  C)  follows  from  the  fact  that  C  has  full  rank. 
Inserting  the  Lagrange  multipliers  of  (3.6)  into  the  iter¬ 
ative  equation  (3.3)  we  have 

W(k+1)  =  W(k)  -  Hd-C^CrV]  fRXXW(k)-RXdJ  +C(CTC)_1[5F-CTW(k)] 

(3.7) 

The  algorithm  may  be  rewritten,  defining  the  n-dimensional 
vector 

F  =  CfC^C)"1?  (3.8) 

W (k+1)  =  [I- C^C)”1^]  [W(k)  -  UR^Wfk)  +^Rxdl  +F.  (3.9) 

Equation  (3.9)  is  a  deterministic  gradient- descent 

algorithm  that  converges  to  the  optimal  weight  vector  W* 

of  Theorem  1  for  a  suitably  small  choice  of  the  gain  M- 

(proof  given  in  Section  VI).  However,  it  requires  knowledge 

of  the  correlation  matrices  R^  and 

study  are  assumed  unavailable  a  priori.  But  recall 

=E{k(k)XT  (k) )  and  R^  =  E  (X  (k)d  (k) )  ,  so  an  easily- 

til 

available  and  simple  approximation  for  R  at  the  k 

AA 

iteration  is  the  outer  product  of  the  observation  vector 
with  itself:  X(k)XT(k)  ;  likewise  if  d(k)  is  available 


R^  ,  which  in  this 


a  simple  approximation  for  at  the  Xth  iteration  is 

X(k)d(k)  .*  This  substitution  gives  the  stochastic  algorithm 

W(k+1)  *  [I-C(CTC)~  1CT]  [W(k)-  MX  (k)XT(k)W  (k)  +  HX(k)d(k)] +  f  , 

(3.10) 

which  can  be  simplified  using  y(k)  «X (k)W(k)  and 
e(k)*d(k)-y(k)  to 

W(k+1)  -  [I-C(CTC)"  V^WOO  +  M-e  (lc)X  (1c)  ]  +F  .  (3.11) 

Equation  (3.11)  is  the  Cons  trained- IMS  algorithm.  It 
is  a  stochastic  gradient-descent  algorithm  satisfying  the 
constraint  that  CTW(k)«<f  at  all  times  (check:  CTW(k+l)  ~  5F ) . 
At  each  iteration  it  requires  only  the  observations  X(k) 

(and  d  (k)  if  required).  No  a  priori  knowledge  of  R^ 
or  is  needed.  The  most  comples  operation  is  the  multi¬ 

plication  of  a  constant  matrix  times  a  vector,  which  is  a 
substantial  savings  over  the  matrix  multiplications  and 
inversions  required  (either  explicitly  or  implicitly)  by  a 
direct  implementation  of  the  optimal  equations. 

The  algorithm  was  derived  heuristically.  Its  convergence 
to  the  optimum,  rate  of  convergence,  and  steady- state 

fAs  mentioned  previously,  better,  but  more  complex,  esti¬ 
mates  for  R^  are  available,  such  as  j^ZX(i)XT(.i)  . 

See  Saradis,  .et  al.  [24]  for  use  of  this  estimate  in  another 
algorithm;  and  Mantey  and  Griffiths  [18]  for  a  closely 
related  estimate.  For  discussion  and  use  of  simpler 
estimates,  see  Moschner  [20] ,  Lender  [15] ,  and  Nuttall  [21] . 

m 

The  use  of  X(k)X  (k)  here  is  a  compromise  between  algorithm 
complexity  and  performance  and  may  be  changed  if  desired. 
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performance  are  shown  in  Section  V.  The  next  section 
develops  the  theory  of  constrained  gradient  descent  from  a 
geometrical  viewpoint. 


IV.  A  GEOMETRICAL  VIEW  OF  THE  ALGORITHM 


A  geometrical  interpretation  of  the  Constrained- IMS 
algorithm  (3.11)  is  now  given.  Results  will  be  found  that 
permit  an  easier  and  more  intuitive  derivation  of  the 
properties  of  the  algorithm  than  would  otherwise  be  possible. 
Readers  interested  in  applications  may  skip  to  Section  VI. 

We  start  from  basic  definitions. 

Definition  (Subspace)  Let  a  and  3  be  real  scalar  numbers. 
A  nonempty  subset  S  of  a  vector  space  is  called  a 
subspace  if  every  vector  of  the  form  aV  +  f3W  is  in  S 
whenever  V  and  W  are  both  in  S  . 

Since  a  subspace  must  contain  at  least  one  element  w  , 

it  must  also  include  the  zero  vector  0  because  0  •  W  =  0  . 

Thus  every  subspace  includes  the  origin. 

Let  2  be  the  set  of  all  n-dimensional  weight  vectors 

satisfying  the  homogeneous  form  of  the  constraint  equation 
T 

C  W  =  0  . 

2  -  (W  :  CTW  =  0)  .  (4.1) 

Then  we  have 

Geometrical  Property  1.  The  set  2  =  (W  :  CTW  =  0 )  defined 
by  the  homogeneous  form  of  the  constraint  equation 
is  a  subspace. 


-24- 
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Proof  of  Geometrical  Property  1. 

Let  V  and  Z  be  vectors  in  2  .  They  must  satisfy 
T  T 

the  equations  C  V  =  9  and  C  Z  =  0  .  Therefore  for  any 

constants  a  and  (3  ,  the  vector  Y  =  av  +  0Z  also 
T 

satisfies  C'£-Q  ,  so  the  set  2  is  a  subspace. 

This  completes  the  proof  of  Geometrical  Property  1. 

Definition  (Linear  Variety)  A  linear  variety  is  a  trans¬ 
lation  of  a  subspace. 

A  linear  variety  L  may  be  expressed  by  the  set  equation 
L  =  S  +  U  ,  where  S  is  a  subspace  and  U  is  any  vector  in 
the  linear  variety.  The  linear  variety  L  is  said  to  be 
parallel  to  the  subspace  S  . 


Fig.  4.1.  A  linear  variety  and  its  subspace. 

Let  T  be  the  set  of  all  weight  vectors  W  satisfying 
T 

the  constraint  C  W  =  ?  . 


r  =  {w  :  cTw  =  y }  . 


(4.2) 
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This  definition  leads  to 

Geometrical  Property  2.  The  set  r  *  (W  :  CTW  *J)  defined 

by  the  constraint  equation  is  a  linear  variety  parallel 
to  the  subspace  2  . 

Proof  of  Geometrical  Property  2. 

We  must  show  that  a  vector  W  is  in  r  if  and  only 
if  it  can  be  written  as  the  sum  o^  a  vector  in  2  and  a 
translation  vector. 

(IF)  Let  the  translation  vector  U  be  in  r  and  Z. 

T  T 

be  any  vector  in  2  .  Then  C  U«J  and  C  Z  *  0  .  Thus 

W  =  Z  +  U  satisfies  CTW  =  CT  (Z  +  U)  ■  ?  ,  so  if  U  is  in 

T  the  sum  of  any  vector  in  2  and  U  is  in  r  . 

(ONLY  IF)  Now  suppose  a  vector  W  in  r  satisfied 
T 

C  W  =  5  but  could  not  be  written  as  the  sum  of  U  and  a 

vector  in  2  .  Then  it  follows  that  the  vector  W-  U  could 

T 

not  be  written  as  a  vector  in  2  .  But  C  (W-  U)  =  =  0 

so  w-  u  is  in  2  .  Contradiction. 

This  completes  the  proof  of  Geometrical  Property  2. 

Geometrical  Property  3.  The  shortest  vector  from  the 

origin  to  the  linear  variety  r  is  the  vector 
T  —  1 

F  =  C  (C  C)  y  ,  which  is  orthogonal  to  r  . 

Proof  of  Geometrical  Property  3. 

2  T 

We  want  to  find  the  vector  W  minimizing  ||w||  =W  W 
T 

while  satisfying  C  W=?  .  Use  the  method  of  Lagrange 


multipliers.  Form  the  function  H(W)  by  adjoining  the 
constraint  to  the  cost  criterion: 
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This  is  the  vector  F  appearing  in  the  algorithm  (3.11) 
and  defined  in  (3.8).  As  a  check  that  F  is  in  r  note 
that  CTF  =  CTC(CTC)"15  =5  . 

We  wish  to  show  F  is  orthogonal  to  r  .  Vectors 


parallel  to  the  linear  variety  itself  are  the  vectors  of 
the  parallel  subspace  2  .  Any  vector  Z  in  2 
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is  orthogonal  to  F^C^C)"1?  since  Z  satisfies  CTZ  =  e 
and  so  the  inner  product  FTZ  *  yT  (CTC )“  1CTZ  =  0  . 

This  completes  the  proof  of  Geometrical  Property  3. 

k. 

Note  from  the  above  proof  that  any  vector  of  the  form 
CY  ,  where  Y  is  an  m-vector,  ie  orthogonal  to  the  constraint 
variety  r  . 

Geometrical  properties  1-3  are  illustrated  in  Fig.  4.2. 

The  (nxn)  matrix  appearing  between  brackets  in  the 
algorithm  (3.11)  has  an  interesting  geometrical  interpretation. 

t 

Call  the  matrix  P  . 

P  £  I-C(CTcrV  .  (4.3) 

The  following  definition  appears  in  Luer;berger  [16]:  , 

Definition.  Let  a  vector  W  have  a  unique  representation 
as  the  sum  of  two  vectors,  one  from  subspace  Z  and 
the  other  from  the  subspace  Zj.  perpendicular  to  Z  . 

Thus  let  W  *  w„  +  Wj_  ,  where  W„  e  Z  ,  Wx  e  Zx  .  The 
operator  f  defined  by  fW^Wj,  is  called  the 
projection  operator  onto  Z  . 

In  other  words,  a  projection  operator  acts  as  an 
identity  operator  on  components  in  Z  and  as  a  zero  operator 
on  components  in  Z_,_  . 

Geometrical  Property  4.  P  is  a  projection  operator  k 


onto  Z  . 


30 


Proof  of  Geometrical  Property  4. 

Any  weight  vector  W  can  be  represented  as  k 

W  =  (W-  PW)  +  FW  . 

k 

The  vector  EW  =  [  I-C  (CTC)“  1CT]W  is  in  2  since 

CT(PW)  «  CT[I-C(CTC)“1CT]W  »  [CT- CT]W  =  6  .  The  vector 

(W-  FW)  =  C  (CTC)"" ^CTW  is  in  2X  .  This  is  true  because  by 

T 

definition  of  2  for  all  vectors  Ze2  ,  C  Z*0  ;  then 
every  vector  Z  in  2  is  orthogonal  to  (W-  EW)  since 
ZTC(CTC)  *CTW  =  0T  (CTC)  ^CTW  *  9  .  Therefore  we  may  make  the 
identifications:  (W-  IW)  =  Wx  €  2j_  ;  EW*W(|  e2  *  and 

W  =  Wn  +  ,  where  W„  and  Wx  satisfy  the  terms  of  the 

definition.  By  the  second  identification,  P  is  the 
projection  operator  onto  2  . 

This  completes  the  proof  of  Geometrical  Property  4. 

The  geometrical  interpretation  of  P  is  shown  in 
Fig.  4.3. 

The  algorithm  (3.11)  may  be  rewritten  in  terms  of  the 
projection  operator: 

W (k+1)  =  P[w(k)  +  M-e  (k)X  (k)  ]  +  F  .  (4.4) 

It  should  be  mentioned  that  the  vector  -e  (k)X  (k)  is  an 
estimate  of  the  unconstrained  gradient  VwJ  .  The  uncon¬ 
strained  gradient,  given  in  (3.4),  is  RxXW^”RXd  • 

Replacing  by  X(k)XT(k)  and  by  X(k)d(k) 

results  in  X  (k)X^  (k)W  (k)  -  X  (k)d  (k)  * -e  (k)X  (k)  ,  where 
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T 

e(k)=d(k)-X  (k)W(k)  .  The  algorithm  is  now  considered 
as  a  whole. 

The  algorithm  attempts  to  minimize  the  cost  function 
E([d  (k)  -  WTX(k)  ]^}  by  iterating  to  the  optimal  weight 
vector  W#  along  the  constraint.  Figure  4.4  shows  the 
position  of  a  hypothetical  adaptive  weight  vector  at  iteration 
k  and  the  position  of  the  optimal  weight  vector. 


Position  of  the  adaptive  weight  vector  W(k)  at  the 
kth  iteration  and  the  optimal  constrained  weight 
vector  W*  . 

W,  =  [  I  -  P^C  (CTR^C )-  1CTE^1Hxd  +  (cV'c  f  \  . 


Fig.  4.4. 
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The  operation  of  the  Constrained- IMS  algorithm  (4.4) 
is  shown  in  Fig.  4.5.  In  this  example,  the  unconstrained 
negative  gradient  estimate  e(k)X(k)  is  scaled  by  H  and 
added  to  the  current  weight  vector.  The  resulting  vector 
is  projected  onto  the  subspace  E  ,  producing  a  vector 
parallel  to  the  constraint  variety  A  .  This  vector  is 
translated  out  to  the  constraint  surface  by  adding  it  to 
F  ,  forming  the  new  weight  vector  W(k+1)  satisfying  the 
constraint. 


Fig.  4.5.  Operation  of  the  Constrained- IMS  algorithm. 
W (k+ 1 )  =  P[W(k)  +  ne(k)X(k)]  +  F  . 

It  is  now  shown  that  any  difference  vector  between  two 
vectors  satisfying  the  constraint  must  lie  in  2  (see  Fig. 
4.6).  An  identity  that  will  be  useful  in  the  next  section 
is  given. 
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Geometrical  Property  5.  Let  and  W2  be  in  r  and  let 

their  difference  be  .  Then  V  is  in  the 

subspace  2  and  PV  *  V  . 

i 

Proof  of  Geometrical  Property  5. 

Since  and  W2  are  in  r  ,  CTv  *  C^Vf^  -  CTW2  *  5-5  =  Q  , 

so  V  is  in  2  .  By  definition  of  a  projection  operator, 
if  Ve2  then  PV « V  .  Algebraically, 

PV»  [I  -  C(CTC)-1CT]V  -  V  +  9  -  V  . 

This  completes  the  proof  of  Geometrical  Property  5. 

Note  also  that  P  ia  symmetric  and  idempotent,  i.e„, 

» 

PT  -  P  ,  (4.5) 

and  -< 

P2  -  P  .  (4.6) 

These  are  verified  by  carrying  out  the  operations.  The 
idempotence  relation  (4.6)  for  a  matrix  that  is,  in  general, 
neither  the  zero  nor  the  identity  operator  is  interesting 
because  it  is  impossible  in  the  scalar  case.  It  is  a  result 
of  the  fact  (not  proven  here)  that  P  has  only  zero  and 
unity  eigenvalues. 


V.  PERFORMANCE 


In  Part  A  of  this  section  it  is  shown  that  the  mean 
adaptive  Constrained- IMS  weight  vector  converges  to  the  optimal 
constrained  weight  vectc.:  of  Theorem  1.  Rates  of  conver¬ 
gence  along  the  eigenvectors  of  the  matrix  PRxxp  are  given. 

In  part  B  it  is  shown  that  the  difference  in  steady- state 
performance  between  the  algorithm  and  the  optimal  estimator 
can  be  made  arbitrarily  small  by  decreasing  the  adaptive 
gain  constant  M-  . 


A.  Convergence  in  Mean  to  the  Optimum  and  Rate  of  Convergence 


The  algorithm  (4.4)  is  repeated  here  in  a  more  convenient 


form* 


W(k+1)  =  P[W(k)  -  MX(k)X  (k)W(k)  +  MX(k)d(k)]  +  F  .  (5.1) 


Note  that  the  weight  vector  W(k)  is  a  function  of  W(0)  , 

(X  (k- 1)  ,X  (k- 2) , . . . ,X (0) }  and  (d  (k-1)  ,d  (k-2) , . . .  ,d  (0) )  . 

It  was  assumed  at  the  beginning  of  Section  III  that  the 
observation  vectors  X  are  independent^  so  X(k)  is  inde¬ 
pendent  of  W(k)  .  Taking  the  expected  value  of  both  sides 
of  (5.1)  we  have  an  iterative  equation  in  the  mean  value  of 
the  Constrained- IMS  weight  vector 

EW  (k+1)  =  P[EW(k)  -  PLRxxEW(k)  +  ^Rxd]  +  F  .  (5.2) 

^This  is  believed  to  be  an  overly- restrictive  assumption  but 
greatly  simplifies  the  analysis.  For  a  special  case  of  the 
algorithm  (no  constraints),  Daniell  [6]  has  shown  e- convergence 
assuming  that  the  X's  are  only  asymptotically  independent. 
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Convergence  of  the  mean  is  easily  established  using 
identities  for  expressing  F  and  Rxd  in  terms  of  the 
optimal  weight  vector: 

F  =  [I-  P]W*  ,  (5.3) 

**Xd  “  PRXXW*  *  (5.4) 

both  of  which  are  verified  directly  using  (2.4),  (4.3),  and 
(3.8).  Let  V(k+1)  be  the  difference  between  the  mean 
adaptive  weight  vector  at  iteration  k+1  and  the  optimal 
weight  vector: 

V (k+1)  ^EW(k+l)-W*  .  (5.5) 


From  (5.2)- (5.5)  an  equation  for  the  difference  process 
may  be  constructed: 

V  (k+1 )  =  P[EW  (k )  —  W^EW  (k )  +  ^PR^W*  ]  +  [  I  -  P]W*  -  W* 

*  PV(k)  -  ^PR^Vfk)  .  (5.6) 

T 

Using  PV  =  (VP)  *V  from  Geometrical  Property  5  and  (4.5) 
obtain 

V (k+1)  =  [I-  M'PR^PjV (k) 

=  [I  -  UPR^P^+Mo)  .  (5.7) 

The  matrix  PR^P  is  the  correlation  matrix  of  projected 
observations,  i.e.,  E(  (PX)  (PX)T)  .  The  non-zero  eigenvalues 


of  this  matrix  are  extremely  important  in  determining  both 


n 


the  convergence  rate  of  the  algorithm  and  its  steady- state 
performance  relative  to  the  optimum.  The  matrix  being 
(nxn)  and  symmetric  is  diagonalizable  into  n  orthogonal 
eigenvectors.  It  is  shown  in  Appendix  C  that  m  of  the 
eigenvectors  of  PR^P  lie  entirely  outside  the  subspace 
2  and  have  zero  eigenvalues;  the  other  (n-m)  eigenvectors 
lie  entirely  within  2  and  have  strictly  non- zero  eigen¬ 
values.  All  of  the  "action"  is  in  the  subspace  2  . 

Call  the  (n-m)  non- zero  eigenvalues  of  PR^P 
c^,  i=l,  2,  . . . ,  (n-m)  ,  and  call  the  n  (non-zero)  eigenvalues 
of  R^  A^, i=l, 2, . . .  ,n  .  To  get  a  feeling  for 
the  relationship  between  the  o's  and  the  A's  ,  it  is 
proven  in  Appendix  C  that  the  non- zero  eigenvalues  of 
PR^P  all  fall  between  the  largest  and  smallest  eigen¬ 
values  of  ,  that  is,  for  1  i  <  (n-m) 


A  .  <o.  <  a.  <  o  <  A  ,  (5.8) 

min  —  mm  —  i  —  max  —  max  '  v 

where  the  subscripts  min  and  max  denote  respectively 
the  smallest  and  largest  members  of  a  set. 

Since  V(0)  is  the  difference  between  two  vectors 
satisfying  the  constraint  (5.5),  from  Geometrical  Property  5 
V(0)  lies  entirely  within  the  subspace  2  and  may 
therefore  be  expressed  as  a  linear  combination  of  eigen¬ 
vectors  of  PRxxP  corresP°ndmg  to  nornzero  eigenvalues. 

If  V(0)  is  equal  to  an  eigenvector  of  PR^P  ,  e..  with 
eigenvalue  then 


(5.9) 


V(k+1)  =  [I-  HPRxxP]k+1ei 
=  [1-  Hoi]k+1ei  . 

Thus  the  convergence  along  any  eigenvector  of  pr^p  is 
geometric  with  geometric  ratio  [1-  and  associated 

time  constant 

»-l/ln(l-  M^)  =  l/Uoi  ,  (5.10) 

where  the  approximation  is  valid  for  «  1  .  it  is  clear 

then  that  if  4  is  chosen  so  that 

0  <  <  ^max  '  <5'n> 

then  the  euclidean  norm  of  the  difference  vector  is  bounded 
between  two  ever- decreasing  geometric  progressions 

[l-  %ax]lc',‘1llv(0)ll  ^  II v (k+i) ||  £  [l-  ^min]k+1||v(0)|| 

(5.12) 

and  the  expected  value  of  the  weight  vector  converges  to  the 
optimum  with  time  constants  given  by  (5.10)  if  the  initial 
difference  is  finite. 

We  emphasize  that  convergence  of  the  mean  shown  here 
is 

lim  ||EW(k)-wJ|  -  0  .  (5.13) 

k-»°° 


■  . . .  - 
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B.  Steady- state  Performance  Compared  to  Optimum 

In  this  subsection  the  performance  of  the  Constrained- IMS 
algorithm  is  compared  with  the  optimum  of  Theorem  1  after 
Juransients  have  become  negligible. 

To  allow  the  Constrained- IMS  algorithm  to  operate  in 
qual^- stationary  (i.e.,  slowly  time- varying)  environments, 
the  adaptive  gain  a  remains  constant  during  the  application 
of  the  algorithm.  (In  stochastic  approximation  schemes  the 
gain  is  usually  allowed  to  go  to  zero  as  time  passes.)  As 
a  result  of  continually  adapting,  the  weight  vector  has  a 
non- zero  variance  about  its  optimal  value.  In  a  stationary 
noise  field,  the  effect  of  variations  about  the  optimum 
weight  vector  is  to  add  a  slight  additional  cost  in  excess 
of  that  achievable  by  the  optimum.  (See  Brown  [2]  for 
results  on  time- varying  noise  fields.) 

The  excess  cost  normalized  by  the  optimum  cost  level 
is  a  dimensionless  quantity  called  "misadjustment"  by 
Widrow  [28]  and  is  a  measure  of  how  closely  the  algorithm's 
performance  achieves  the  optimal  performance.  Steady- state 
misadjustment  is 


Mfli) 


=  lim 
k-»0 


Cost  of 

Adaptive  Filter 
_  at  time  k  _ 

- 

Cost  of  1 

Optimal  Filterl 

r* - i 

Cost  of 
Optimal  Filter 
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For  the  constrained  least- mean- squares  problem  of  Theorem  1 
the  steady- state  misadjustment  is 

EffdOc)  -  WT(k)X(k)]2}  -  E{[d(k)- W*X(k)]2} 

M(n)  =  lim - = - = -  (5.14) 

k-0  E  { [d  (k )  -  W^X  (k )  ]  j 

Under  the  assumptions  that  d(k)  and  the  components  of 
X(k)  are  jointly  Gaussian-distributed  and  independent  from 
observation  to  observation  it  is  possible  to  calculate  very 
tight  bounds  on  M(M-)  by  a  method  due  to  Moschner  [20]  . 

For  an  adaptive  gain  constant  satisfying 

1 

0  <  n  <  -  ,  (5.15) 

°max  +  a/ajTrfPR^Pi 

it  is  shown  in  Appendix  B  that  steady- state  misadjustment 
may  be  bounded  by 

2  U  - TT 

1-2fTr(PRXXP)  +  20min]  1 "  2  ^Tr  (PRXXP)  +  2°max^ 

(5 . 16) 

M  (U- )  can  be  made  arbitrarily  close  to  zero  by  suitably 
small  choice  of  gain  constant  M-  ;  this  means  that  the  steady- 
state  performance  of  the  Constrained- IMS  algorithm  can  be 
made  arbitrarily  close  to  the  optimum.  From  (5.10)  it  is 
seen  that  such  cost  performance  is  obtained  at  the  expense 
of  increased  convergence  time. 


TrfPR^P) 


VI .  APPLICATIONS 


In  this  section  adaptive  solutions  are  given  to  the 
problems  defined  in  Theorem  1  and  its  corollaries.  At  the 
same  time  the  performance  of  each  adaptive  algorithm  is 
given.  The  important  application  to  array  processing  is  the 
main  example  of  this  section. 

The  results  of  the  preceding  section  are  summarized 
in  a  companion  theorem  to  Theorem  1: 

Theorem  2.  (Adaptive  Constrained  Least- Mean- Squares  Optimi¬ 
zation)  Let  {d  (Tc ) )  be  a  sequence  of  random  variables 
and  f X  (1c ) }  be  a  sequence  of  n- dimensional  data  vectors 
of  observed  random  variables.  Each  vector  X(k)  is 
assumed  to  be  produced  independently  by  an  unknown 
ergodic  source  with  unknown  correlation  matrices 

E  (X  (k  )XT  (k) }  =  (nxn) 

E  [X  (k)d  (k) }  =  (n  x  1) 

and  positive  definite.  The  algorithm 

W (k+1)  =  P[W  (k)  +  M-e  (k)X  (k) J  +  F  ,  (6.1) 

where 

rp  —IT 

P=  [I  -  C(CXC)  C  ]  , 

F  =  C  (CTC)_1  7  , 

CTW  (0 )  =  7 

and 


-41- 
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e(k)  =  d(k)  -  WT(k)X(k)  , 


converges  in  the  mean  to  the  optimum  weight  vector  W* 
solving  the  constrained  IMS  problem  defined  in 
Theorem  1: 


minimize  E { [  (d  (k)  -  WTX  (k)]  2 ) 
subject  to  CTW  =  , 


(6.2) 


if 

0  <  M-  <  - -  .  (6.3) 

°max  +  2  Tr'PRXXP> 

Further, 


1 


i)  the  convergence  time  constant  of  the  difference 
between  EW(k)  and  W*  along  the  i*"*1  eigen¬ 
vector  of  PRxxP 


-1 

Ti  “  In (1  -  40i ) 


l/Uoi 


(6.4) 


where  ck  is  the  eigenvalue  corresponding  to 
the  i^  eigenvector  of  PR^P  * 

ii)  Under  the  additional  assumption  that  variables 
d(k)  and  X(k)  are  jointly  Gaussian 
distributed,  the  steady- state  misad justment  of 
the  adaptive  solution  can  be  bounded  by  (5.16). 


Proof  of  Theorem  2.  (See  Section  V) 

This  completes  the  proof  of  Theorem  2. 
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Example  1.  (Consistent  Modeling) 

A  single- input,  single-output  system  is  to  be  modeled 
with  a  tapped- delay- line  filter.  It  is  known  that  the  system's 
steady- state  response  to  a  unit  step- function  input  is  a 
particular  number  a  ,  and  this  feature  is  to  be  incorporated 
into  the  model. 


u  (♦) 


UNKNOWN 

d(t) 

SYSTEM 

:c 

„/t\.uiT< 

e(t) 


TOL  FILTER 


Fig.  6.1.  Tapped- delay- line  filter  modeling  a  system. 


Let  the  input  to  the  system  and  model  be  a  random 
variable  u (t)  .  The  model  is  a  tapped-delay- line  filter 
with  n  tap  points  and  a  delay  of  A  seconds  between  each 
tap.  Inputs  pass  down  the  tapped-delay- line  (TDL)  filter. 

Let  the  states  at  each  tap  point  be  denoted  (t),  i=l,  2,  . . . ,  n  . 
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Thus  the  first  state  is  equal  to  the  input,  x^t)  =u(t)  , 
the  second  state  is  equal  to  the  input  delayed  by  A  seconds, 
x2(t)  =u(t-  A)  ,  and  so  forth.  The  output  y(t)  of  the 
TDL  filter  is  a  weighted  sum  of  the  states.  Let  the  weight 

on  the  i^  state  be  w^  and  form  the  n- dimensional  vectors 

T  T 

of  weights  W  »  (w^w^ — ,wn)  and  states  X  (t)  = 

(Xj.  (t),x  (t), ..  .,xn(t))  .  The  output  of  the  TDL  filter  at 

time  t  is  then  y(t)  *WTX(t)  .  The  desired  output  of 

the  model  is  the  output  of  the  unknown  system  d(t)  .  The 

error  is  the  difference  between  the  desired  output  and  the 

actual  output:  e  (t )  =  d  (t )  -  y  (t )  . 

The  constraint  is  now  included.  If  the  systems  are  given 

a  unit  step  input  (i.e.,  u(t)=l),  then  after  nA  seconds 

T  T 

the  TDL  filter  will  be  in  steady  state,  with  X  (t)  =J,  = 
(1,1,..., 1)  .  Thus  the  constraint  that  the  steady-state 
response  of  the  TDL  filter  be  a  a  is  equivalent  to  requiring 

1TW  =  a  .  (6.5) 

The  consistent  modeling  problem  is  therefore 

2 

minimize  E{e  (t)} 

<P  (6.6) 

subject  to  ,1  W  =  a  . 

To  apply  the  Constrained  I^IS  algorithm,  the  state  vector  and 

‘til 

error  are  sampled  at  intervals  of  T  seconds.  At  the  k 
sampling  instant  the  state  vector  is  X(kT)  and  the  error  is 
e(kT)  .  The  time  between  samples  T  is  made  large  enough 
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so  that  X(kT)  is  essentially  independent  of  X(jT)  for 
j  /k  .  (As  noted  in  Section  V  this  is  not  believed  to  be 
absolutely  necessary).  The  algorithm  is  therefore 


W (k+1)  =  P[W (k)  +  |ie  (kT)X  (kT)]  +a  (6.7) 

where 

p  «  i-idV1!1  -  * 

and  the  weight  vector  is  assumed  to  be  constant  for  t  in 
kT  ^  t  <  (k+l)T  . 

A  practical  matter  arises  here.  It  may  be  difficult 
to  calculate  the  permissible  upper  bound  on  p.  given  by 
(6.3),  especially  if  the  autocorrelation  matrix  R^  is 
not  known.  An  easily  measured  quantity  guaranteed  to  be 
no  higher  than  the  permissible  upper  bound  is 


1 

I  Tr<Rxx>  ' 


(6.8) 


That  is,  if  M-  is  chosen  to  satisfy 


0  <  \i  <  n0  ,  (6.9) 

then  it  is  guaranteed  to  satisfy  (6.3).  Observe  that  |iQ 
can  be  calculated  directly  and  easily  from  observations 
since  Tr  (R^)  =E[X  (k)X(k)]  ,  the  sum  of  the  powers  of  the 


states . 


A  special  case  of  the  algorithm  given  in  Theorem  2  is 
the  celebrated  IMS  algorithm.  The  following  is  a  companion 


to  Corollary  1.1. 

Corollary  2.1.  (Adaptive  Least- Mean- Squares  Optimization  — 
Widrow  and  Hoff)  Let  the  sequences  fd(k))  ,  (X(k)} 
and  their  (unknown)  correlation  matrices  be  defined  as 
in  Theorem  2.  The  algorithm 

W (k+1)  =  W(k)  +  He  (k)X(k)  (6.10) 

where 

e  (k)  •■=  d(k)  -  WT(k)X(k)  . 

converges  in  the  mean  to  the  optimum  weight  vector 
solving  the  unconstrained  least- mean- squares 
optimization  problem  defined  in  Corollary  1.1: 

2 

minimize  E(e  (k)}  ,  (6.11) 


if 


0  <  u  < 


1 


max 


Tr<*xx> 


(6.12) 


Further, 


i)  the  convergence  time  constant  of  the  difference 
between  EW(k)  and  along  the  ith 

eigenvector  of  is 


47 


where  A .  is  the  eigenvalue  corresponding  to 
til 

the  i  eigenvector  of  . 

ii)  Under  the  additional  assumption  that  variables 

d(k)  and  X(k)  are  jointly  Gaussian  distributed, 
the  steady  state  misadjustment  of  the  adaptive 
solution  can  be  bounded  by 

a  Tr(Rxx}  u  Tr{Rxx) 

1 "  "2  tTr  (RXX)  +  2\iin^  1-|fTr<RXX,  +  2W 

(6.14) 


/ 


Proof  of  Corollary  2.1. 

The  projection  operator  P  of  Theorem  2  goes  to  the 
identity  when  all  constraints  are  removed.  The  vector  F 
vanishes . 

This  completes  the  proof  of  Corollary  2.1. 

Uses  of  linear  least- squares  algorithms  are  abundant 
and  well-known  so  no  examples  will  be  given. 

The  next  corollary  is  a  companion  to  Corollary  1.2. 

Corollary  2.2.  (Adaptive  Least- Mean- Squares  Distortionless 

Estimate)  Let  the  sequences  (d(k)}  ,  (X(k))  ,  and  their 
(unki  own)  correlation  matrices  be  defined  as  in 
Theorem  2.  Further,  let  each  X(k)  be  of  the  form 
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X  (k)  =  CB  (k)  +  N  (k)  , 


(6.15) 


where  C  is  a  known  (nxm)  matrix  with  n  >m  and  [B(k)} 
is  a  sequence  of  unknown  m-dimensional  vectors.  Each 
B(k)  may  be  a  vector  of  random  variables  with  unknown 
mean  (so  E(b)  =  B)  ,  or  it  may  be  a  vector  of  unknown 
parameters,  in  which  case  E(B)*B  =  B.  (N  (k ) }  is  a 
sequence  of  unknown  r>- dimensional  zero- mean  random 
vectors  considered  as  noise.  B(k)  and  N(k)  are 
assumed  uncorrelated.  Let 


The  algorithm 


« 


W  (k+1 )  =  P[W(k)-M-y(k)X(k)]  +  F  (6.17) 

where 

T  -IT 

p  =  [i-  c (crc)  Acrj 

T  - 1 

F  =  C(CAC)  ■Lff 

CTW(0)  =  V 

y(k)  =  wT(k)x(k)  , 

converges  in  the  mean  to  the  optimum  weight  vector  * 

W  solving  the  least-mean- squares  distortionless 
*2 

estimation  problem  defined  in  Corollary  1.2: 
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minimize  E  { [b .  -  WTX  (k)  ]  2 

_  m  (6.18) 

subject  to  E  {  (b^  -  W  X  (k) )  =  0  , 

as  long  as  p.  satisfies  condition  (6.3)  of  Theorem  2. 

The  convergence  rates  and  misadjustment  are  the 
same  as  those  of  Eqs.  (6.4)  and  (5.16)  of  Theorem  2. 


Proof  of  Corollary  2.2. 

It  was  shown  in  Corollary  1.2  that  the  problem  (6.18) 
may  be  reformulated  as 


minimize"  E{[WTX(k)]2} 
T 

subject  to  C  W  *  J  . 


(6.19) 


This  is  just  the  problem  of  Theorem  2  with  d(k)  =0  . 
Accordingly  in  the  Constrained- IMS  algorithm  (6.1),  d(k) 
is  set  to  zero  and  the  corollary  follows .  The  requirements 
on  p.  and  performance  are  unchanged. 

This  completes  the  proof  of  Corollary  2.2. 

Example  2.  (State  Estimation  under  Uncertainty) 

A  system  is  described  by  the  equations 


B  (k+1)  =  $B(k)  +  ru(k) 

X  (k)  =  C  B  (k)  +  N  (k)  , 


(6.20) 


where  C 

is  a  known 

(n  x  m) 

matrix  with 

n  >  m  .  B  (k) 

is 

the  state 

vector  and 

U(k) 

is  the  input 

vector.  N(k) 

is 

zero- mean 

measurement 

noise 

.  The  matrices  <t>  and  r 

are 
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not  known;  the  statistics  of  U(k)  and  N(k)  are  not  known. 
We  wish  to  estimate  a  component  bi(k)  of  the  state  vector. 
The  algorithm  (6.17)  converges  to  the  best  constant  linear  least 
squares  unbiased  estimator  of  a  component  of  B(k)  .  If 
an  estimate  of  the  entire  vector  is  desired,  m  algorithms 
like  (6.1)  may  be  used,  with  the  unit  entry  in  a  different 
place  in  each  f  vector. 


Note  that  the  amount  of  knowledge  required  for  an 
estimate  of  B(k)  using  the  constrained  IMS  algorithm  is 
a  small  fraction  of  that  required  by  the  Kalman  filter. 

(The  Kalman  filter  is  the  optimum  unconstrained  time- varying 
linear  least- squares  estimator  for  the  state  vector  of  the 
dynamic  system  (6.20),  and  requires  that  all  system  matrices 
and  correlation  matrices  be  known.) 

The  next  corollary  is  a  companion  to  Corollary  1.3  . 

Corollary  2.3.  (Adaptive  Leasb-Mean-Squares  Filtered  Estimate) 
Let  the  sequence  (d(k))  ,  (X(k)}  ,  and  their  (unknown) 
correlation  matrices  be  defined  as  in  Theorem  2.  Let 

X (k)  «CB(k)+N(k),  (6.21) 

B(k)  and  N(k)  be  vectors  of  random  variables  as  in 
Corollary  1.3.  Further,  let  B(k)  be  written 

B(k)  «  A(k)  +  A (k)  ,  (6.22) 

where  M(k)}  and  (A(k))  are  sequences  of  ro- dimensional 
vectors  of  random  variables.  We  wish  to  estimate  the 


V 


y 
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ith  component  of  4(k)  ,  si(k)  .  A  filter  vector  ? 
is  given,  which  may  be  based  on  as  much  or  as  little 
information  about  the  statistics  of  4(k)  and  A(k) 
as  is  known. 

The  algorithm 

W  (k+1)  =  P[W(k)  -  Hy  (k)X(k)]  +F  ,  (6.23) 

where 

T  -IT 
P  *  [I-  C  (CXC)  ■LCA] 

T  - 1 

F  =  C(C  C)  7 

CTW(0)  =  7 

y(k)  =  WT(k)X (k)  , 

converges  in  the  mean  to  the  optimum  weight  vector  W#3 
solving  the  least- mean- squares  filtering  problem  given 
in  Corollary  1.3: 

minimize  E  ( (s .  (k)  -  WTX  (k)J  2  } 

T  <6*24) 
subject  to  C  W  =  5  , 

as  long  as  M-  satisfies  condition  (6.3)  of  Theorem  2. 

The  convergence  rates  and  misadjustment  are  the 
same  as  those  of  Eqs.  (6.4)  and  (5.16)  of  Theorem  2. 

Proof  of  Corollary  2.3. 

In  Corollary  1.3,  it  is  shown  that  the  problem  (6.24) 


may  be  reformulated  as 
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minimize  E{[WTX(k)]^} 

subject  to  CTW  =  l.i  .  (6.25) 

As  before,  this  is  just  the  problem  of  Theorem  2  with 

« 

d(k)  =0  .  The  results  follow  from  Theorem  2. 

This  completes  the  proof  of  Corollary  2.3. 

The  next  example  is  the  major  example  of  the  paper,  and 
is  one  of  the  main  reasons  for  an  interest  in  adaptive 
constrained  least  squares  optimization:  It  is  shown  in  this 
example  that  adaptive  constrained  ms  optimization  makes  ” 

possible  the  near-optimum  processing  of  data  from  an  array 
of  antennas  or  other  sensors  with  very  little  a  priori 
information  about  the  signals  and  noises  involved.  In 

* 

contrast,  known  adaptive  processors  converging  to  the  optimal 
unconstrained  least -mean- squares  filter  [12]  require  know¬ 
ledge  of  either  the  signal  or  noise  statistics. 

Example  3 .  (The  Array  Processor) 

In  most  applications  involving  arrays  of  sensors  — 
notably  sonar,  seismology,  radio  communication,  and  radio 
astronomy  using  antenna  arrays —  it  is  desirable  to  reduce 
antenna  sensitivity  to  unwanted  signals  and  noises  while 
processing  the  signals  of  interest  in  real  time.  For 
example,  arrays  of  sonar  hydrophones  provide  information 
about  the  undersea  environment;  it  may  be  desirable  to  listen 

* 

to  signals  coming  from  a  particular  direction  and  simul¬ 
taneously  avoid  hearing  the  noise  of  the  sonar  ship's  own 
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machinery  and  screws  [22]  .  In  geology,  sub-arrays  of 
seismometers  are  being  used  in  the  large- aperture  seismic 
array  in  Montana  (4)  to  listen  to  seismic  events;  such  arrays 
must  discriminate  against  noises  emanating  from  surrounding 
cfties.  In  radio  communications  using  antenna  arrays,  it 
is  desirable  to  receive  signals  from  one  direction  while 
ignoring  the  signals  from  amateur  radio  operators  and  other 
electromagnetic  noises  [5]  .  Radio  astronomers  using  antenna 
arrays  want  to  look  in  one  part  of  the  sky  while  discrimi¬ 
nating  against  other  radiation  sources  impinging  on  antenna 
sidelobes  [25,  p.  33] .  In  all  these  applications,  it  is 
desirable  to  have  a  processor  that  can  discriminate  against 

unwanted  noises  in  real  time  and  that  requires  a  minimum  of 

*. 

^  priori  information. 

An  array  processor  is  a  filter  both,  in  frequency  and  in 
space.  A  typical  processor  configuration  is  shown  in  Fig. 

6.2.  The  array  has  K  sensors  with  a  tapped  delay  line 
following  each  sensor.  Each  line  has  L  tap  points  and 
'  delays  of  A  seconds  between  taps.  Signals  and  noises 
impinging  on  the  array  are  converted  to  voltages  which  pass 
down  the  tapped  delay  lines  and  the  weighted  sum  of  these 
voltages  is  the  output  of  the  filter.  By  proper  choice 
of  weights,  the  array  processor  can  discriminate  against 
unwanted  noises  distributed  both  in  frequency  and  space. 

The  filter  separates  desirable  signals  from  other  noises. 
In  this  example,  to  be  "desirable"  a  signal  must  come  from 


a  particular  chosen  direction  in  space,  called  the  "look 
direction".  All  signals  coming  from  other  directions,  plus 
any  measurement  or  amplifier  noise,  are  termed  "undesirable 
noise".  But  not  all  signals  coming  from  the  look  direction 
are  desirable;  some  noise  comes  from  the  look  direction  and 
is  called  "look  direction  noise". 

The  signal  is  modeled  as  a  zero-mean  random  process 
emanating  from  the  look  direction  in  the  far  field  of  the 
array.  It  is  assumed  that  the  propagating  medium  is  linear, 
non- dispersive,  and  that  propagation  times  along  the  signal 
phase- front  are  well  enough  known  that  the  array  can  be 
steered,  electrically  or  mechanically,  in  the  direction  of 
the  signal.  Sources  in  the  look  direction,  i.e.,  desired 
signal  and  look  direction  noise,  are  assumed  to  be  statis¬ 
tically  uncorrelated  with  noises  emanating  from  other 
directions.  (This  rules  out  multipath.)  Finally,  all  the 
sensors  are  assumed  to  have  identical  characteristics  (but 
are  not  necessarily  omnidirectional). 

It  should  be  mentioned  that  sonar  and  seismic  signals 
are  generally  low  frequency  (audio  or  lower)  and  may  be 
processed  in  real  time  using  the  adaptive  algorithm  imple¬ 
mented  by  present-day  hardware  (13J  .  In  radio- frequency 
applications,  however,  the  signals  must  first  be  demodulated. 

As  in  Example  1,  let  the  observations  at  each  tap  points 
at  time  t  be  denoted  (t) ,  i=l,  2, . . . ,n  .  In  this  case 
n  =  KL  ,  the  number  of  sensors  times  the  number  of  taps  per 
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sensor.  Let  the  weight  on  the  observation  be  w.  and 

1 

T 

form  the  n-dimensional  vectors  of  weights  W  =  (wi'w2* *  *  * ,wn) 

T 

and  observations  X  (t)  ■  (x1 (t) f x2 (t) # . . . , x  (t) )  .  The  output 

T 

of  the  array  processor  at  time  t  is  then  y(t)  *W  X(t)  . 

Let  the  signals  arriving  "in  the  beam"  (i.e.,  from  the  look 
direction)  at  time  t  be  denoted  b(t)  .  Because  the 
array  is  steered  toward  the  look  direction,  signals  arriving 
"in  the  beam"  enter  each  of  the  K  tapped-delay  lines 
simultaneously  and  parade  in  parallel  down  the  lines  (see 
Fig.  6.2).  All  K  taps  on  the  first  column  of  taps  have 
the  same  beam  component,  b(t)  ,  and  a  different  undesirable 
noise  component  (t) , i»l, . . . ,K  from  noises  entering  from 
other  directions  and  amplifier  noise;  every  tap  on  the 
second  column  of  taps  has  the  same  beam  component  b(t-  A) 
and  a  different  noise  component  *1^  (t) , i»K+l, . . . , 2K.  ,  and 
so  forth.  Forming  the  L- dimensional  vector  of  beam  signals 
on  each  column  at  time  t  ,  B  (t)» [b  (t)  ,b  (t- A) , . . .  ,b  (t-  (L-  1)A)] 
and  the  n-dimensional  vector  of  undesirable  noises  on  each 
tap  at  time  t  ,  NT(t)«[,H1  (t),^  (t),  . .  .,T)n  (t ) ]  it  is  seen 

from  Fig.  6.2  that  the  vector  of  observations  may  be  written 

X(t)  =  C  B(t)  +N(t)  ,  (6.26) 


where 


1 


0 


0 


0 


A 


c  = 


0 


0 

1 

• 

1 


0 

0 


KL  (6.27) 


1 


0 


0 


1 


T 


Due  to  the  array  steering  it  is  particularly  easy  to 
specify  the  frequency  response  of  the  processor  in  the 
look  direction,  since  all  K  taps  in  every  vertical  column 
of  taps  have  identical  beam  components:  The  processor  output 
is  formed  by  a  weighted  sum  of  the  observations;  therefore, 
as  far  as  the  beam  signal  is  concerned,  the  total  weighting 
it  receives  at  each  vertical  column  is  the  sum  of  the  weights 
in  that  column  (see  Fig.  6.3).  For  the  beam  signal,  the 
multichannel  processor  could  be  replaced  by  a  single- channel 
filter.  Each  weight  on  the  single- channel  filter  would  be 
equal  to  the  sum  of  weights  on  the  corresponding  column  of 
the  multichannel  filter. 

Let  the  ifc^  weight  on  the  beam- signal- equivalent  filter 
in  Fig.  6.3  be  fi  .  Let  the  L-dimensional  vector  of  filter 
weights  be 


. ..  .  . 
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(6.28) 


Each  weight  is  the  sum  of  the  weights  in  the  ifctl  column 

of  the  multichannel  processor.  Referring  to  the  definition 
of  C  above,  it  is  seen  that  this  statement  is  equivalent  to 

?  =  CTW  .  (6.29) 


5  determines  the  transfer  function  of  the  processor  in 
the  look  direction.  If,  for  example. 


(6.30) 


then  all  frequencies  of  signals  arriving  from  the  look 
direction  in  plane  waves  would  be  passed  equally  without 
attenuation  (flat  frequency  response).  Changing  any  of 
the  zero  components  would  result  in  a  different  impulse 
response  and  corresponding  frequency  shaping. 

Recall  that  in  the  beam  signal  b(t)  there  is  a  com¬ 
ponent  of  "desired  signal"  s (t)  and  an  additive  "look 
direction  noise"  a(t)  ,  i.e.,  b  (t)  =  s  (t)  +  a  (t)  .  It  is 
assumed  that  from  any  a  priori  information  he  might  have 
about  the  frequency  content  of  s  (t)  or  a (t) ,  the  processor 
user  specifies  the  look  direction  frequency  response  he 
wants  in  the  form  of  the  vector  .  If  he  has  no  prior 
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information  about  tne  desired  signal  then  a  reasonable 

choice  for  ff  is  the  all-pass  filter  (6.30).  Notice  that  * 

specifying  the  look  direction  frequency  response  constrains 
only  L  "degrees  of  freedom"  of  the  n  weights  in  W  . 

The  remaining  "degrees  of  freedom"  are  used  by  the  processor 
to  reduce  the  power  of  undesirable  noises  N(t)  in  the 
output.  Since  the  response  to  the  beam  signals  is  constrained 


and  the  undesirable  noises  are  assumed  uncorrelated  with  the 
beam  signal,  minimizing  total  processor  output  power  is 
exactly  the  same  as  minimizing  noise  output  power. 


The  problem  is 

minimize  E  f [WTX (t )  ]  2 } 
T 

subject  to  C  W  *  5F 


(6.31) 


and  the  algorithm  is  (6.22):  W  (k+1)  *  P[W  (k)  -  M-y  (kT)X  (kT]  +F 
where  T  is  the  time  between  adaptations,  made  sufficiently 
large  so  that  successive  vectors  X  are  essentially  inde¬ 
pendent  . 

In  this  case  P  is  simple  and  sparse  due  to  the 
simple  form  of  the  constraint  matrix  (6.26).  The  matrix 
multiplication  by  P  is  more  simply  regarded  as  a  series  of 


4 


1 


additions  and  scalar  multiplications: 
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1 

1 

K 


0 


0 


0 


0 


0 


0 


0  0 


0 


1 


.1  _1 

K  **•  K 


0 


0 


(6.32) 


1  0 
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1 

K 


1 


0 


0  —i 

K 


1 

K 
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A  computer  simulation  of  the  processor  was  made  using  a 
low- precision  language  (BASIC)  on  a  small  computer  (the  HP 
2116).  The  processor  had  four  sensors  on  a  line  spaced  at 
A  second  intervals  and  had  four  taps  per  sensor  (thus  n=16). 
The  environment  had  three  point  noise  sources,  and  white  noise 
added  to  each  sensor.  Power  of  the  beam  signal  was  quite 
small  in  comparison  to  the  power  of  interfering  noises 
(see  Table  6.1).  The  tap  spacing  defined  a  frequency  of 
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SOURCE 


Beam  Signal 

Noise  A 

Noise  B 

White  Noise 
(per  tap) 


DIRECTION 

POWER  (0°  IS  NORMAL 
TO  ARRAY) 


CENTER 
FREQUENCY 
(1.0  is  1/A) 


BANDWIDTH 


Table  6.1.  Signals  and  noises  in  the  simulation 


1.0  (i.e.,  f  -  1.0  is  a  frequency  of  1/a  Hz.).  In  the  look 
direction,  foldover  frequency  for  the  processor  response  was 
1/2  A,  or  0.5.  All  signals  were  generated  by  a  pseudo- random, 
pseudo- Gaussian  generator  and  passed  through  a  filter  to 
give  them  the  proper  spatia.*  and  temporal  correlations.  All 
temporal  correlations  were  arranged  to  be  identically  zero 
for  time  differences  greater  than  25  A  .  The  time  between 
adaptations  was  assumed  greater  than  58  A  ,  so  successive 
samples  of  X(kT)  were  generated  independently. 

The  look  direction  filter  was  specified  by  the  vector 
hT=.1,-2,1.5,2,  which  resulted  in  a  frequency  characteristic 
shown  in  Fig.  6.4.  The  signal  and  noise  spectra  are  shown 
in  Fig.  6.5  and  their  spatial  position  in  Fig.  6.2. 

In  this  problem,  the  eigenvalues  of  R^  ranged  from 
0.111  to  8.355.  The  upper  permissible  bound  on  the  gain 
constant  u  calculated  by  (6.3)  was  .074;  a  value  of 
U  *  .01  was  selected,  which,  by  (5.16)  would  lead  to  a  . 
misadjustment  of  between  15.2  and  17.0$. 


POWER 
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7 


4 


The  processor  was  initialized  with  W  (0)  =  F  =  c  (CTC)“  1!f  , 
and  Fig.  6.6  shows  performance  as  a  function  of  time.  The 
upper  graph  has  three  horizontal  lines.  The  lower  line  is 
the  output  power  of  the  optimum  weight  vector.  The  closely- 
spaced  upper  two  lines  are  upper  and  lower  bounds  for 
optimum  output  power  plus  misadjustment.  The  mean  value 
of  the  processor's  output  power  falls  somewhere  between  the 
upper  and  lower  bounds.  The  difference  between  the  initial 
and  steady  state  power  levels  is  the  amount  of  undesirable 
noise  power  the  processor  has  been  able  to  remove  from  the 
output. 

Although  the  weight  vector  is,  in  theory,  constrained 
T 

to  satisfy  C  W(k)  *  S  at  all  times,  very  small  deviations 
occur  in  an  actual  implementation  due  to  quantization  and 
computational  errors.  The  lower  graph  in  Fig.  6.6  shows 
the  squared  Euclidean  distance  between  the  weight  vector 

m  o 

and  the  constraint  ||c  W(k)-J||  •  An  error- correcting 

feature  of  the  Constrained- IMS  algorithm  prevents  the 
deviations  from  the  constraint  from  growing. 


■«hir#*s  i». 
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A  last  corollary  deals  with  the  deterministic  constrained 
least  squares  problem. 

Corollary  2.4.  (Gradient-descent,  Deterministic  Constrained 
Least  Squares)  Let  R^  be  a  known  (nxn)  matrix 
and  R^  be  a  known  n- vector.  The  algorithm 

V7(k+1)  *  P[W(k)  -  UR^WOc) +URxd)  +  F  (6.33) 

where 

P  *  (I  -  CtAfVl 

T  - 1 

F  «  C(C  C)  y 

cTw  (0)  «y  , 

converges  deterministically  to  the  solution  W*  of 
the  problem 

T  T 

minimize  (a  -  2W  R^  +  W  R^W] 

subject  to  C^W  «  J  ,  (6.34) 

where  a  is  any  finite  constant,  as  long  as  M- 
satisfies  (6.3).  The  convergence  time  along  eigen¬ 
vectors  of  PRxxP  9*ven  ky  (6*4)  and  there  is  no 
steady- state  misadjustment. 

Proof  of  Corollary  2.4. 

This  algorithm  is  the  same  as  the  recursive  relation 
(5.2)  for  the  mean  weight  vector  of  the  stochastic  constrained 
IMS  algorithm.  Showing  that  the  stochastic  algorithm 
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converges  in  the  mean  is  therefore  the  same  as  showing 
(6.3)  converges.  Convergence  in  the  mean  was  proved  in 
Theorem  2. 

This  completes  the  proof  of  Corollary  2.4. 

Remark:  See  Rosen  [23]  for  an  alternative  solution 
to  this  problem. 


VII.  SENSITIVITY  OF  ALGORITHM  TO  CALCULATION  ERRORS 

The  constrained-  IMS  algorithm  is  related  to  the  gradients 
projection  algorithms  due  to  Rosen  [23] ,  Lacoss  [14] ,  and 
Booker  [1]  .  The  difference  between  the  Constrained- IMS 
algorithm  and  gradient- projection  algorithms  lies  in  the  way 
information  about  the  location  of  the  constraint  surface  is 
carried.  As  Fig.  4.5  showed,  the  Constrained  IMS  algorithm 
(3.11)  "knows"  the  orientation  of  the  constraint  surface  by 
the  matrix  C  ,  and  its  translation  from  the  origin  by  the 
vector  F  .  In  this  section,  it  is  shown  that  gradient- 
projection  algorithms  use  only  the  orientation  matrix  C  ; 
to  ensure  that  the  weight  vector  stays  on  the  constraint 
surface,  they  rely  exclusively^  on  the  fact  that  the  weight 
vector  is  initialized  on  the  constraint  surface  and  always 
moves  parallel  to  it.  The  gradient- projection  method  is 
shown  to  be  sensitive  to  quantization  errors  which  may  cause 
the  weight  vector  to  deviate  from  the  constraint  on  long 
runs . 

Differences  in  the  algorithms  may  be  traced  to  Eq.  (3.5) 
of  the  derivation.  If  CTW(k)  is  replaced  by  S  in  (3.5) 
and  R^  *  0  *  the  gradients  projection  algorithm  of  Booker 
results.  (CTW(k)  should  equal  ^  if  W(k)  exactly  satisfies 

^ Rosen  recognized  this  problem  and  suggested  using  a  second 
algorithm  to  "reset"  the  weight  vector  to  the  constraint 
whenever  errors  became  excessive. 
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the  conscraint.  It  is  shown  in  Fig.  6.6  that  it  may  be 

unreasonable  to  assume  that  W(k)  is  exactly  on  the  constraint 

at  all  times.  In  the  derivation  of  the  Constrained- I/1S 

T 

algorithm,  the  term  C  W(k)  was  carried  instead  of  replacing 
it  by  5  .  Carrying  the  term  corresponds  physically  to 
assuming  that  W(k)  may  not  precisely  satisfy  the  constraint, 
perhaps  due  to  the  quantization  error  of  a  digital  imple¬ 
mentation. 

T 

The  algorithm  that  results  from  replacing  C  W(k)  by 

3  is 

W(k+1)  =W(k)+M-Pe(k )X(k)  ;  CTW(0)  =3  .  (7.1) 

This  is  a  gradient- projection  algorithm.  It  is  so  named 
because  the  unconstrained  gradient  is  projected  onto  the 
constraint  subspace  and  then  added  to  the  current  weight 
vector.  Its  operation  is  shown  in  Fig.  7.1  (compare  with 
Fig.  4.5). 


Fig.  7.1.  Operation  of  the  gradient- projection  algorithm  (7.1). 
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If  somewhere  in  the  computation  an  error  occurs  due 
to  quantization  and  the  weight  vector  is  a  bit  off  the 
constraint  at  time  k  ,  Fig.  7.2  shows  that  the  Constrained- IMS 
algorithm  (3.11)  will  bring  the  weight  vector  back  to  the 
constraint  in  the  next  iteration;  however,  the  gradient- 
projection  algorithm  (7.1)  assumes  that  W(k)  satisfied 
the  constraint  and  adds  a  change  parallel  to  the  constraint 
surface,  continuing  the  error. 

An  algebraic  analysis  is  obtained  by  assuming  that  at 
each  iteration  the  actual  processor  introduces  a  small  vector 
of  errors  £(k)  to  the  weights.  The  update  equations  for 
the  two  algorithms  become 

Constrained- 1/13 : 

from  (3.11):  W (k+1)  «  P[W(k)+ue  (k)X (k)]  +  F  +  £  (k)  (7.2) 

Gradient-  Projection : 

from  (7.1):  W  (k+1)  «  W(k)+UPe  (k)X  (k)  +  £  (k)  ; 

CTW(0)  =  ?  (7.3) 
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Fig.  7.2.  Error  propagation.  The  Constrained- IMS  algorithm  (A) 
corrects  deviations  from  the  constraints  while  the 
gradient— pro jection  algorithm  (B)  allows  them  to 
accumulate . 


where  undefined  products  are  taken  to  be  the  identity.  Now 
noting  that  CTP  =  CT[I-C (CTC)  ^CT]  =  0  and  premultiplying 
by  C  to  see  how  the  weight  satisfies  the  constraint  we 


have 


CTW(k+l)  =  CT[F+£(k)]  =s+CT£(TO  . 


(5.6) 


In  a  perfect  implementation  the  right  side  of  (7.6)  would 
be  !f  .  With  quantization  errors  and  using  the  Constrained- 
I/1S  algorithm,  the  weight  vector  is  off  the  constraint 
only  by  a  term  linear  in  the  last  error  vector. 

Now  an  error  analysis  on  the  gradient- projection 
algorithm  is  made.  Performing  a  backward  iteration  on 
(7.3)  produces 

W(k+1)  =  W(k)-|iP[X(k)XT{k)W(k)  -  X(k)d(k)]  +$(k)  (7.7) 


k  r 

-  *{- 
i=0^ 


PLPX(i)  XT(i)J  W(0) 


k  r  k  j 

+  Y,\  TT  (I-UPX(j)XT(j)]  S  (HPX(i)d  (i)+Ui)l 

i-°  j»i+l  I 


(7.8) 


CTW (k+1)  =  cTW(0)  +  cT  Y.  W  -  5  +  C 


t  ?(i) 


(7.9) 


The  last  term  of  (7.9)  shows  how  the  algorithm  (7.1)  accumu¬ 
lates  deviations  from  the  constraint. 


If  the  computation  errors  are  modeled  as  a  zero- mean 


process  (2  7)  ,  the  gradient- projection  algorithm  does  a 


random  walk,  away  from  the  constraint  with  variance  increasing 
as  the  number  of  iterations  (see  Appendix  D) . 

A  simulation  of  the  gradient  projection  algorithm  on 
the  array  problem  (Example  3)  was  made,  using  exactly  the 
same  data  as  used  by  the  Constrained- IMS  algorithm.  The 
results  are  shown  in  Fig.  7.3.  The  lower  part  of  Fig.  7.3 
shows  how  the  gradient- pro jection  algorithm  walks  away  from 
the  constraint.  Note  the  change  in  scale.  If  the  errors 
of  the  Constrained  IMS  algorithm  (Fig.  6.6)  were  plotted  on 
the  same  scale  they  would  not  be  discernible.  Further,  the 
errors  of  the  gradient- pro  jection  method  are  expected  to 
continue  to  grow. 


CONSTRAINED -LIflS  ALGORITHM 


VIII.  SUMMARY 


A  general  algorithm  was  developed  for  stochastic  linear 
least- squares  optimization  subject  to  linear  equality 
constraints.  The  algorithm  has  three  major  properties: 

First,  it  has  very  modest  computational  requirements; 
second,  it  requires  very  little  a  priori  knowledge;  third, 
it  converges  to  an  optimal  filter.  A  fourth  property  is 
that  the  algorithm  can  operate  continuously  without  wandering 
from  the  constraints. 

Rate  of  convergence  and  steady- state  performance  of  the 
general  algorithm  are  derived.  Special  cases  of  the  algorithm 
are  treated,  with  examples.  An  important  application  of  the 
algorithm  is  the  real-time  processing  of  data  from  an  array 
of  sensors. 
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APPENDIX  A 


DERIVATION  OF  GRIFFITHS'  MIA  ALGORITHM 
BY  THE  QUADRATIC  PENALTY  FUNCTION  METHOD 


The  purpose  of  this  appendix  is  to  show  that  the 
Maximum  Likelihood  Ratio  (MIA)  algorithm  due  to  Griffiths 
[11]  may  be  considered  as  an  algorithm  solving  a  least- 
mean-  squares  problem  subject  to  "soft"  linear  equality 
constraints.  This  gives  a  simpler  derivation  than  the 
original  and  immediately  illuminates  some  properties  of  the 
algorithm  that  are  well-known  general  properties  of  quadratic 
penalty  function  algorithms.  As  a  side  benefit,  a  general 
method  of  generating  adaptive  algorithms,  based  on  the 
quadratic  penalty  function  method,  is  indicated. 

The  quadratic  penalty  function  method  is  a  way  of  turning 
a  constrained  optimization  problem  into  an  easily1- solved 
unconstrained  optimization  problem.  Given  a  cost  function 
j (W)  and  a  vector- valued  constraint  function  4>(W )«0  , 
the  problem 


minimize  J  (W) 
subject  to  $(W)  «  9 


(A.  1) 


is  changed  to 

minimize  J  (W)  +  p24>T  (W)$  (W)  .  (A. 2) 


As  the  scalar  (3-.*  the  solution  to  the  unconstrained  problem 
(A. 2)  goes  to  the  solution  of  the  problem  (A.l).  The  second 
problem  is  easily  solved  by  standard  unconstrained  optimi¬ 
zation  techniques. 
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The  specific  problem  considered  by  Griffiths  is  the 

problem  of  Example  3,  Section  VI,  where  J(W)  *WTRXXW  and 
T 

<|>(W)  =  C  W-y  .  The  algorithm  is  derived  by  forming  the 
function 

H  (W)  =  ~  WTRxxW  +  32  (CTW- 5  )T(CTW- y  )  ,  (A.3) 

and  taking  the  gradient  with  respect  to  W 

V  =  Rxxw  +  e2c(cTw-y  )  .  (A. 4) 

The  iteration  is  then 

W (k+1)  =  W(k)  - 

=  W(k)  -  ^R^WQO  -  n£2C(CTW  (k)  -  y  )  .  (A. 5) 

T 

is  replaced  by  its  estimate,  X(k)X  (k)  ,  giving 

W (k+1)  =  W(k)  -  HX(k)XT(k)W(k)  -  M-02C(CTW(k)  -  y  )  .  (A. 6) 

This  is  Griffiths'  MLR  algorithm. 

We  infer  from  this  derivation,  and  well-known  properties 
of  penalty  function  schemes  [3]  ,  [17]  that: 

i)  the  algorithm  has  an  error  correcting  property, 
i.e.,  it  will  not  wander  far  from  the  constraint 
in  the  sense  of  the  gradient  projection  algorithm 
discussed  in  Section  VII. 

ii)  However,  the  satisfaction  of  the  constraint  is 
"soft",  i.e.,  for  finite  values  of  0  the 
solution  of  (A. 2)  will  not  exactly  satisfy  the 
constraint. 


~ "I.1.1'  "I  ""W 


i.m  jii  1.19  i.j in  i.a  mmp 
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iii )  Increasing  8  to  cause  the  weight  vector  to 

more  nearly  satisfy  the  constraint  will  increase 
the  convergence  time  of  the  algorithm. 


APPENDIX  B 


STEADY- STATE  MISADJUSTMENT 

Moschner  [20]  calculated  the  mis  adjustment  for  the 
algorithm 

W  (k+1)  =  P[W  (k)  -  M-y  (k)X  (k)]  +  F  .  (B.l) 

By  his  method  precisely  the  same  results  for  misadjustment 
may  be  obtained  for  the  algorithm 

W (k+1)  *  P[W(k)  +  ne(k)X(k)]  +F  ,  (B.2) 

where  e(k)  *d(k)-y(k)  and  the  optimal  weight  vector  of 

(B.2)  is  defined  to  be  W*  of  Theorem  1. 

A  slight  improvement  in  the  bounds  obtained  by  Moschner 

is  possible  by  noting  in  his  equation  (D.19)  that  since 

B  *E(V  VT)  and  V  =  PV  by  Geometrical  Property  5  and 
n  n  n  n  n 

P2  =  P  ,  then 

Tr(PRBnR)  =  Tr(PRPBnR)  ,  (B.3) 

and  so 

°min  Tr(BnR)  ^  Tr  (PRBnR)  ^  °max  Tr(BnR)  '  (B'4) 

where  o  .  and  o  are  the  smallest  and  largest  non- 
min  max 

zero  eigenvalues  of  PRP  .  The  result  follows  by  using  the 
above  facts  in  Moschner' s  derivation. 


APPENDIX  C 


LEMMAS  ON  QUADRATIC  FORMS 

Lemma  C.l.  Let  R  be  an  (nxn)  positive- definite  matrix 

and  C  be  an  (nxm)  matrix  (with  n  >m)  having  full 

rank  m  .  Then  the  (nxn)  matrix  CTRC  is  positive 

T  - 1 

definite  and  (C  RC)  exists. 

Proof  of  Lemma  C.l. 

T 

Since  R  is  positive  definite  then  V  RV  >  0  for  any 
n- vector  V^0  .  We  want  to  show  for  any  ro- vector  U^e 
that  UTCTRCU>0  ,  hence,  CTRC  is  positive  definite  and 
its  inverse  exists. 

If  the  vector  V  jf  G  ,  it  has  rank  1.  By  Sylvester’s 
inequality  [9]  ,  the  rank  of  the  product  of  two  matrices  is 
not  less  than  the  sum  of  the  ranks  of  the  matrices,  less 
their  common  dimension.  Letting  p(«)  denote  rank,  the 
rank  of  the  n- vector  CU  is  bounded  by 

P(CU)  >  p  (C)  +  p  (U)  -  m 
_>  m+l-B 

>  1  .  (C.l) 

from  which  we  conclude  CU  is  not  the  zero  vector.  There¬ 
fore,  letting  V  *  CU  we  conclude 

UTCTRCU  =  VTRV  >  0  ,  (C.2) 
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Proof  of  Lemma  C.2. 

Since  PRP  is  a  symmetric  (nxn)  matrix,  it  has  n 
eigenvectors  and  n  eigenvalues.  The  eigenvectors  can  be 
chosen  to  be  orthogonal  [7]  . 

i)  Since  the  matrix  C  has  full  rank  it  has  m  columns 

of  linearly  independent  n- vectors.  Direct  calculation 
T 

shows  that  C  PRP*0  ,  so  the  m  columns  of  C  are 
eigenvectors  of  PRP  with  zero  eigenvalues, 
ii)  There  must  be  (n-m)  remaining  eigenvectors  ortho¬ 
gonal  to  the  columns  of  C  .  As  shown  in  Appendix  E, 
the  columns  of  C  are  vectors  normal  to  the  constraint 
plane  r  and  subspace  Z  .  Therefore,  the  remaining 
(n-m)  eigenvectors  must  be  in  Z  .  As  shown  in 
Geometrical  Property  5  of  Section  IV,  if  V  is  a 
vector  in  Z  ,  then  PV  ■  V  .  Therefore  if  an 
eigenvector  e^  of  PRP  is  in  Z  then 

eTpRPei  =  eTRei  >  0  .  (C.4) 

Lot  be  an  eigenvalue  corresponding  to  an  eigen¬ 

vector  of  PRP  in  Z  .  Then  by  definition 

PRPei  «  oiei  (C  .5 ) 

so 

eTpRPei  *  °ieiei  *  °i  •  (C.6) 
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From  (C.4)  and  (C.6)  it  follows  that 


>  0 


i*l#  2, . . . ,  (n-m)  . 


(C .  7 ) 


iii)  It  is  well  known  that  if  e  is  a  unit  vector 
T 

then  e  Re  is  bounded  by 


\»in  *  e  Re  ' 


where  A  .  and  A  are  respectively  the 
min  max 


(C .  7 ) 


largest  and  smallest  eigenvalues  of  R  .  There¬ 


fore  from  (C.4)  and  (C.6) 


^  °i  *  \ 


The  result  follows. 

This  completes  the  proof  of  Lemma  C.2, 


(C.8) 


APPENDIX  D 


EXPECTED  DEVIATION  FROM  THE  CONSTRAINT 
BY  THE  GRADIENT- PROJECTION  ALGORITHM 


As  an  approximation ,  quantization  in  the  weight  vector 
Is  modeled  as  an  additive  white  noise  process  (see  Widrow, 
[27]  );  the  expected  deviation  from  the  constraint  by  the 
gradient  projection  algorithm  is  computed  as  a  function  of 
time. 


Assume  that  a  fixed- point  representation  for  the  weights 
is  used;  let  the  quantization  size  of  a  single  weight 
be  q  .  Using  Widrow' s  value  for  the  error  variance,  q  /12  , 
from  (7.9)  the  expected  squared  Euclidean  distance  from  the 
constraint  at  time  k  is 


E{|lcTw(k)  -y  |!2} 


=  Tr  (C  ^  I  CT) 

Tr (CCT)  (D.l) 


Thus  the  expected  squared  distance  from  the  constraint 
increases  linearly  with  time  (approximately). 

For  the  special  case  of  the  array  problem,  with  C 
defined  by  Eq.  (6.27),  Tr(CCT)=n  ,  where  n  is  the  number 
of  tap  points.  Equation  (D.l)  becomes 

2 

E(j|CTW(k)  -  J  l|2}  =knfr  .  (D.2) 
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APPENDIX  E 


THE  METHOD  OF  LAGRANGE  MULTIPLIERS 


Consider  the  equality- constrained  optimization  problem 


minimize  J(W) 
subject  to  $>(W)  =  9 


(E.l) 


where  J  ( • )  is  a  scalar  cost  function  and  $(•)  is  a  vector- 

T  2 

valued  constraint  function.  In  Theorem  1,  J  (W)  *E{(d-W  X)  } 

T 

and  4>  (W)  *C  W-  5F  .  Let  the  gradient  of  the  function  J(W) 
with  respect  to  a  vector  W  evaluated  at  be  written 

V<V  where 


V<V 


(E.2) 


w«w.  . 


A  necessary  requirement  for  the  optimal  solution  of  (E.l)  to 
be  at  a  point  WQ  is  that  the  gradient  of  J  with  respect 


to  W  be  normal  to  the  constraint  surface  at  W„ 


If  the 


gradient  of  J  at  WQ  were  not  normal  to  the  constraint 
surface  then  by  sliding  along  the  constraint  a  vector  W1 
could  be  found  still  satisfying  the  constraint;  but  having 


^The  constraint  surface  is  understood  to  be  the  points 
satisfying  the  constraint  4>(W )=0  .. 


lower  cost,  i.e.,  JfW^)  <  J(WQ)  . 

Fleming  ( [8]  ,  p.  126)  shows  that  the  normal  vectors  to 
any  manifold  defined  by  4>(W)  is  vw#  .  For  example, 

ff 

the  gradient  of  the  constraint  defined  by  $(W)  *CTf-J  =  q  is 
T 

vw(C  w-y)  «C  ;  therefore,  each  of  the  m  columns  of  C 
is  a  vector  orthogonal  to  the  constraint  plane  and  any  linear 
combination  of  those  vectors  is  also  orthogonal  to  the  plane. 

Let  A  be  an  undetermined  m- vector  of  multipliers. 

The  vector  CA  is  a  linear  combination  of  the  columns  of 
C  and  so  is  normal  to  the  constraint  plane.  Thus  another 
way  to  express  the  necessary  condition  that  the  gradient  of 
the  cost  function  J  be  orthogonal  to  the  constraint  surface 
is  to  say  that  for  some  choice  of  A  the  gradient  and  the 
normal  may  be  anticollinear,  i.e.,  (see  Fig.  E.l) 

VWJ(Wo)  +cA  *  9  »  (8.3) 

or  more  generally 

V^o*  +  V*(Wo)*  *  9  •  (E*4) 

Another  way  of  writing  (E.4)  analogous  to  the  necessary 
condition  for  unconstrained  optimality  (v^J  (VtQ)  «0)  is  by 
defining  the  function  H(W)  by  "adjoining"  the  cost  function 
to  the  constraint  function  by  the  Lagrange  multipliers, 

H(W)  «  J(W)  +  AT<j>(W)  (E.5 ) 


and  requiring 


-  9  . 


(E.6) 
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Then  since  (WQ )  +  vw<!> (WQ ) X  ,  (E.6) 

is  identical  to  (E.4)  and  the  necessary  conditions  become 
(E.6)  and 

$(WQ)  *  0  .  (E. 7 ) 

For  an  excellent  discussion  of  the  Lagrange  multiplier 
method  in  more  general  applications  see  Bryson  and  Ho  [3] . 


< 


APPENDIX  F 


SIMULATION  OF  THE  DIRECT  SUBSTITUTION  ALGORITHM 

At  the  beginning  of  Section  III  the  direct  substitution 
algorithm  was  suggested:  To  obtain  an  estimate  of  the  opti¬ 
mal  weight  vector,  the  unknown  correlation  matrices  are 
estimated  and  inserted  directly  into  the  equation  for  the 
optimal  weight  vector.  Although  computationally  quite 
difficult  (because  of  the  number  of  matrix  inversions  and 
multiplications  involved)  the  direct  substitution  method 
offers  the  possibility  of  improved  performance. 

The  direct  substitution  algorithm  was  simulated  on 
the  array- processing  problem  of  Example  3  using  exactly  the 
same  data  as  the  Constrained- IMS  processor.  The  direct 
substitution  algorithm  is 

R^OO  -  aRxx(k-l)+  (1-a )X (k-l)XT (k-1)  (F.l) 

W(k)  =  S^(k)C[CT6^(k)C]"15  (F.2 ) 

where  0<a<l  .  Equation  (F.l)  is  an  exponentially-weighted 
estimate  of  the  true  correlation  matrix  R^  .  Equation  (F.2) 
is  the  equation  for  the  optimal  weight  vector  for  the  problem 
with  R^fk)  substituted  for  R^  .  The  constant  a  , 
which  controls  both  rate  of  convergence  and  misad justment, 
was  chosen  to  be  0.97,  a  value  which  experimentally  lead  to 
approximately  the  same  misad justment  as  the  Constrained- IMS 
processor  had  in  Example  3.  R^fO)  was  initialized  to  the 
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identity  matrix,  scaled  by  the  power  measured  on  each  tap; 
in  this  case  the  total  power  on  each  tap  was  2.2  (see 
Table  6.1),  so  1^(0)  was  2.21.  This  is  a  reasonable 
starting  point  since  the  power  on  a  tap  is  easily  measured 
in  a  real  situation  and  also  a  simple  calculation  shows  that 
if  6^(0)  is  any  diagonal  matrix  then  W  (0)  *  C  (CTC)“  =  F  . 
The  vector  F  vna  also  the  initial  weight  vector  of  the 
Constrained- IMS  algorithm  so  the  two  processors  essentially 
start  out  at  the  same  point  and  a  meaningful  comparison  is 
easily  obtained. 

Results  of  the  simulation  are  shown  in  Fig.  F.l. 

Compare  with  Fig.  6.6.  For  the  same  misadjustment,  the 
better  processor  should  have  a  faster  rate  of  convergence. 

A  careful  comparison  of  Fig.  F.l  and  6.6  fails  to  show 
conclusively  which  algorithm  has  the  better  performance. 

For  this  example  at  least,  the  user  would  have  been  just  as 
well  off  to  use  the  simpler  Constrained- IMS  processor. 

Readers  interested  in  the  direct  substitution  method 
should  consult  Saradis,  et  al.  [24]  and  Mantey  and  Griffiths 
[18]  ,  [19]  . 
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PART  II 

ADAPTIVE  ESTIMATION  IN  NONSTATIONARY  ENVIRONMENTS 

by 

James  Edvard  Brown,  III 
ABSTRACT 

In  the  classical  design  of  processors  for  sensor  arrays 
whose  purpose  is  signal  detection  and  estimation,  a  receiver 
is  optimized  on  the  basis  of  the  &  priori  knowledge  of  the 
statistics  of  its  input  signals.  However,  when  the  a  priori 
knowledge  is  not  available,  the  receiver's  performance  can  still 
be  improved  by  performing  measurements  on  its  input  signals 
and  incorporating  this  new  information  into  its  design.  Such 
receivers  are  called  adaptive. 

The  purpose  of  this  research  i3  to  develop  and  analyze 
a  gradient- descent  surface- searching  algorithm  for  automatically 
adjusting  (adapting)  the  parameters  of  a  linear  tapped-delay- 
line  array  processor  in  order  to  improve  its  performance  in 
an  unknown  chancing  environment.  The  tracking  ability  of  this 
algorithm  is  demonstrated  when  the  characteristics  of  the 
nonstationarity  are  such  that  the  optimum  parameter  sequence 
can  be  modeled  as  a  first-order  Markov  process  with  a  known 
transition  function.  A  worst-case  analysis  of  the  algorithm 
is  presented  for  three  types  of  nonstationarities  when  the 
above  model  for  the  nonstationarity  is  not  applicable. 

The  techniques  developed  in  analyzing  the  above  algorithm 
provide  a  powerful  approach  for  the  further  study  of  gradient- 
descent  algorithms  used  in  searching  unknown,  nonstationary 
surfaces.  Among  the  most  important  consequences  are: 

i)  the  removal  of  the  usual  assumption  that  the  data 
be  jointly  Gausrian; 

ii)  the  development  of  a  new  convergence  theorem  for  a 
dynamic  stochastic  approximation  algorithm,  thereby 
extending  a  branch  of  stochastic  approximation  theory 
to  the  analysis  of  adaptive  processors  in  nonstationary 
statistics; 

iii)  the  enlargement  of  the  class  of  problems  for  which 
stochastic  approximation  algorithms,  adaptive  esti¬ 
mation  algorithms,  and  the  Kalman- Bucy  theory  can 
be  compared. 

Also  presented  in  an  appendix  is  a  procedure  for  auto¬ 
matically  adjusting  the  convergence  factor.  Some  experimental 

results  are  presented. 
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I.  INTRODUCTION 


A.  PURPOSE 

In  the  classical  design  of  processors  for  sensor  arrays 
whose  purpose  is  signal  detection  and  estimation  [  1  ]-[  7 ],  a 
crucial  role  is  played  by  the  a_  priori  information  available 
at  the  receiver.  In  practice,  the  receiver's  performance  can 
be  improved  by  performing  measurements  on  its  input  signals 
and  incorporating  this  new  information  into  its  design  [ll]- 
[21].  Such  receivers  are  called  adaptive. 

The  purpose  of  this  research  is  to  develop  and  analyze 
a  procedure  for  automatically  adjusting  (adapting)  the 
processor  in  order  to  improve  its  performance  in  an  unknown 
chancing  environment. 

B.  THF.  PRORT.F.M 

The  type  of  array  processor  considered  in  this  paper  is 
the  multichannel  linear  discrete  time  processor  (also 
referred  to  as  the  tapped- delay- line  processor)  shown  in 
Fig.  1.1.  The  input  x  at  each  receiver  is  sampled  at 
regular  intervals  and  shifted  down  the  tapped  delay  line. 

The  sampled  value  at  each  tap  is  weighted,  and  all  the 
weighted  values  are  summed  to  form  an  output  y  which  will 
be  viewed  as  an  estimate  of  some  desired  quantity  d  .  For 
the  simple  case  when  x  consists  of  a  transmitted  signal  plus 
additive  noise,  the  desired  output  d  is  taken  to  be  the 
transmitted  signal.  In  general,  d  may  be  taken  to  be 


1.1.  General  form  of  tapped- delay- line 
multichannel  filter. 
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some  other  desired  output,  depending  on  the  purpose  of 
the  receiver. 

The  criterion  used  in  this  research  for  determining 
the  best  set  of  weight  values  w  for  the  above  system  is 
the  mean- squared  error  between  the  processor  output  y  and 
the  desired  output  d  [  1  ]  -  [  7  J  .  This  criterion  is  a 
common  one  used  in  the  design  of  array  processors  since  the 
pioneering  work  of  Wiener  [4  ] . 

The  mathematical  description  of  the  array  problem  is 
given  as  follows:  Let  [X^  :k«l,2,...)  be  a  sequence  of 
p- dimens ionzyl  vector- valued  random  variables  to  be  referred 
to  as  the  input  sequence .  The  components  of  X^  are  the 
inputs  xiK*x2k'  •  •  *'xpk  at  tlie  var*ous  taPs  of  the  Processor 
at  time  k  .  Let  {d^  :k  =  l,2,...}  be  a  corresponding 
sequence  of  real- valued  random  variables  to  be  referred  to 
as  the  desired- output  sequence .  The  pair  (d^,X^, )  will  be 
called  the  data  pair  at  time  k  .  Assume  that  the  sequence 
{ (dJc,Xk)  :k  =  l,2,...}  is  an  independent  sequence.  The 
correlation  matrix  at  time  k  =  n  ,  defined  by 


Rn  6  E(XnXn' 


(1.1) 


is  assumed  to  be  positive  definite  with  finite  eigenvalues. 
The  crosscorrelation  vector  at  time  k  =  n  is  defined  by 


The  expectation  operator  will  be  denoted  by  E [ - ]  . 

The  transpose  will  be  denoted  by  T  . 
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Pn“EtdnXnl  • 

Let  W  be  some  p- dimensional  column  vector,  referred 
to  as  the  weight  vector  or  discrete- time  filter,  whose  compo¬ 
nents  are  the  weights  w.,w  ,  ...,w 

—  —  P 

The  object  is  to  estimate  that  weight  vector  W* 
which  minimizes  the  mearv-  squared  error  at  time  n  , 
given  by 

Sn(W)  *  E[(dn- WTXn)2] 

-  E  [d^J  -  2WTPn  +  WTRnW  .  (1.3) 

It  is  a  well  known  result  [42]  that  W*  is  given  by 

wn  *  Rl\  ■  <l-«> 

This  vector,  W*  ,  will  be  called  the  "Wiener"  weight  vector 

or  the  optimum  finite- dimensional  linear  weight  vector  at 

time  n  .  The  corresponding  mean- squared  error  will  be 

denoted  by  .  (A  typical  mean- squared- error  surface  is 

shown  in  Fig.  1.2  when  the  weight  vector  has  only  two 

components.)  Note  that  the  expression  (1.4)  requires  that 

the  second  order  statistics,  R  and  P_  ,  be  known.  It 

n  n 

would  be  highly  desirable  to  have  a  design  procedure  which 
would  not  require  this  _a  priori  information  since  it  normally 
would  not  be  available  to  the  array  processor. 

A  method  for  determining  Rn  and  Pn  that  immediately 
comes  to  mind  is  to  compute  the  time- aver  ages  [14] 


1 

n 


(1.5) 


and 

For  the  stationary  problem  this  would  result  in  the  optimum 
finite-dimensional  linear  estimator  W*  in  the  limit  as 
more  and  more  measurements  become  available.  However,  if 
the  environment  in  which  the  receiver  operates  is  nonstationary, 
the  above  method  is  not  applicable  in  the  determination  of 

fc 

the  instantaneous  value  of  the  optimum  weight  vector  Wn  . 

The  time-averages  (1,  5)  progressively  weighs  the  new.  infor¬ 
mation  contained  in  the  data  pair  (d,X)  less  and  less  as 
time  progresses.  Meanwhile,  the  optimum  weight  vector 

continues  to  change.  Another  procedure  for  estimating 
* 

Wn  will  have  to  be  developed. 

C.  APPROACH 

The  approach  used  in  this  research  for  estimating  (or 

t  *  i 

tracking)  the  optimum  weight- vector  sequence  {WnJ  is  to 
extend  a  gradient- descent  surface- searching  algorithm  (the 
method  of  steepest  descent  [  8  3  -  [10]  )  to  the  tracking  of 
an  unknown  time- varying  surface.  The  resulting  system 
(array  processor  plus  adaptation  algorithm)  gains  the 
capability  of  responding  to  changes  in  the  input-data  sta¬ 
tistics.  This  results  in  an  adaptive  system  whose  performance 
is  vastly  superior  to  that  of  a  fixed  system  in  many  instances. 

Tne  analysis  of  the  adaptation  algorithm  is  divided  into 
two  parts .  In  Chapter  IV  asymptotic  bounds  for  the  performance 
of  an  adaptive  processor  are  obtained  when  its  input  is 
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nonstationary.  The  characteristics  of  the  nonstationarity 
are  such  that  the  optimum  weight- vector  sequence  fw*]  can 
be  modeled  as  a  first-order  Markov  process  with  a  known 
transition  function.  However,  in  many  applications  it 
is  unreasonable  to  expect  that  a  model  for  the  nonstationarity 
1A1  be  available.  For  this  reason  a  worst- case  analysis  of 
€nel  adaptive  system  is  presented  in  Chapter  V  for  three  types 
of  nonstationarities .  As  shown  by  the  example  given  in 
Chapter  VI,  these  results  are  particularly  informative  as 
to  the  type  of  behavior  to  expect  from  the  adaptive  system. 

D.  CONTRIBUTIONS 

The  principle  contributions  of  this  research  are: 

1)  A  gradient- descent  surface- searching  algorithm  is 
developed  for  adapting  the  parameters  of  a  linear 
tapped- delay- line  array  processor  in  an  unknown, 
time- varying  environment.  The  tracking  ability  of  this 
algorithm  is  demonstrated  when  the  nonstationarity  is 
modeled  by  the  optimum  weight- vector  sequence  {WnJ 
being  a  first-order  Markov  process  with  a  known  transition 
function.  A  worst- case  analysis  of  the  algorithm  is 
presented  for  three  types  of  nonstationarities  when  the 
above  model  for  the  nonstationarity  is  not  applicable. 

2)  The  techniques  developed  in  analyzing  the  above  algorithm 
provide  a  very  powerful  approach  for  the  further  study 

■jf  gradient- descent  algorithms  used  in  searc'  ur.kr  ,wn. 
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nonstationary  surfaces.  Among  the  most  important 
consequences  are: 

(i)  the  removal  of  the  usual  as  sumption  that  the 
data  pair  (d,X)  be  jointly  Gaussian. 

(ii)  the  development  of  a  new  convergence 

theorem  for  a  dynamic  stochastic  approxi¬ 
mation  algorithm,  thereby  extending  a 
branch  of  stochastic  approximation  theory 
to  the  analysis  of  adaptive  array  pro¬ 
cessors  in  nonstationary  statistics. 

(iii)  the  development  of  an  analytical  comparison 
between  the  adaptation  algorithm  and  the 
Kalman- Buey  recursive  filter.  This  result 
is  presented  in  Appendix  I. 


II.  TRADITIONAL  METHODS  FOR  DESIGNING  ADAPTIVE  PROCESSORS 

A  large  number  of  array  processors  which  adjust  their 
parameters  as  a  function  of  their  inputs  have  been  considered 
in  the  literature.  A  representative  sample  is  given  in  the 
references  [11]-  [52].  Two  basic  types  of  systems  have 
resulted  from  the  above  research:  a  parametric  system 
[11]-  [13]  and  a  non- parametric  system  [14]-  [52].  The 
parametric  system  is  characterized  by  the  assumption  of  an 
underlying  statistical  framework  for  the  input  data;  e.g., 
(d,X)  jointly  Gaussian,  or  the  waveform  of  d  known  with 
X  Gaussian,  etc.  This  system  is  inevitably  specialized  to 
specific  applications.  On  the  other  hand,  the  non- parametric 
system  is  less  structured  and  more  applicable  to  a  wider 
range  of  problems. 

The  work  directly  related  to  the  research  presented 
in  this  paper  is  in  the  area  of  the  non- parametric  design  of 
array  processors.  Within  this  classification  there  are  a 
number  of  approaches  to  the  design  problem.  The  most 
promising  statistical  procedures  are  those  which  can  keep 
pace  with  the  incoming  data  so  as  to  constantly  update  the 
receiver's  current  state  of  knowledge  about  its  environment 
[17  ]  -  [52  ] .  There  are  primarily  two  approaches  in  changing 
the  processor '  s  parameters  in  "real-time":  stochastic 
approximation  [17]-  [40]  and  adaptive  estimation  [41]-  [52]. 
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The  application  of  these  two  theories  to  processor 
design  in  minimum- mean-  squared  error  estimation  problems 
yields  a  differential  correction  algorithm  based  on  the 
method  of  steepest  descent  [8  ]  -  [  9].  The  form  of  the 
algorithm  is 


Wn+1  *  "n-  Vn  ' 


(2.1) 


where  Vf  is  an  estimate  of  WT  ,  H  is  the  convergence 
n  n  n 

th 

factor  for  the  n  iteration,  and  Zn  is  an  estimate  of 
Jn(Wn)  ,  the  gradient  of  the  mean- squared  error  surface 
Sn(W)  with  respect  to  W  evaluated  at  W  =  Wn  .  (Methods 
for  obtaining  the  estimate  ZR  will  be  discussed  later.) 

The  stochastic  approximation  version  of  the  algorithm 
(2.1)  is  characterized  by  the  sequence  (^n)  tending  to 
zero  in  some  prescribed  manner;  the  adaptive  estimation 
version  of  the  algorithm  (2.1)  is  characterized  by  the 
sequence  fl-Ln)  being  set  equal  to  some  prescribed  positive 
constant  M-  .  The  former  procedure  is  designed  to  estimate 
the  unknown  parameters  W  in  a  strong  probabilistic  sense 
(mean-square  and  almost  surely),  while  the  latter  is 
designed  to  allow  a  "tolerance"  in  the  estimates.  As  will 
be  shown  in  this  research,  by  allowing  convergence  in  a  weak 
sense,  the  class  of  nonstationary  problems  that  can  be  handled 
by  adaptive  estimation  theory  is  larger  than  those  handled 


by  stochastic  approximation  theory. 
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The  history  of  stochastic  approximation  theory  began 
with  its  introduction  in  1951  by  Robbins  and  Monro  [17]. 

The  results  directly  related  to  the  present  research  were 
obtained  by  Blum,  Gardner,  Dupac,  and  DeFiguerido.  In 
1954,  Blum  [18]  extended  the  Robbins  and  Monro  procedure 
to  the  estimation  of  a  multivariate  parameter.  This  permitted 
the  application  of  stochastic  approximation  theory  to  the 
analysis  of  the  gradient  descent  algorithms  in  optimization 
theory  [19]-  [40  ].  Gardner  [21  ]  demonstrated  the  appli¬ 
cability  of  this  approach  in  the  design  of  adaptive 
predictors . 

The  development  of  stochastic  approximation  algorithms 
for  estimating  a  time- varying  parameter  received  little 
attention  until  1965  when  Eupac  [22]  published  his  classic 
paper  on  dynamic  stochastic  approximation  methods. 

DeFiguerido  [24]  extended  this  work  to  the  estimation  of 
a  multivariable  parameter  evolving  in  a  nonlinear  fashion. 

The  development  of  the  IMS  adaptation  algorithm  was 
motivated  by  Widrow  in  considering  determinstic  gradient 
procedures  for  use  in  pattern  recognition  [41].  The 
IMS  algorithm  was  later  applied  to  adaptive  filtering  by 
Widrow  [42]  and  by  Widrow  et  al.  [43],  because,  in  part,  of 
its  conjectured  ability  to  track  nonstationarities.  Griffiths 
[44]  later  modified  the  algorithm  for  certain  array  appli¬ 
cations.  Senne  [46]  provided  an  exact  analysis  of  both 
Widrow' s  and  Griffiths'  algorithms  under  a  special  set  of 
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assumptions  (stationary,  jointly  Gaussian  statistics). 
However,  the  technique  used  in  Senne's  analysis  has  not  been 
shown  to  generalize  for  non- Gaussian,  time- varying  statistics 
Daniell  [50]-  [51]  has  demonstrated  an  approach  to  the 
problem  that  can  be  generalized.  It  is  this  approach  that 


will  be  extended  in  this  research. 


III.  THE  ADAPTATION  ALGORITHM 


< 


* 
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A.  THE  DERIVATION  OF  THE  ALGORITHM 

The  starting  point  for  the  derivation  of  the  algorithm 
to  be  considered  in  this  research  is  the  procedure  given  in 
Chapter  II  by 


Wn+1  =  Wn-  Vn  ' 


(3.1) 


where  WR  is  an  estimate  of  Wn  ,  M-n  is  the  convergence 

factor  for  the  n  iteration,  and  ZR  is  an  estimate 

of  Jn(wn)  •  the  gradient  of  the  mean-squared  error  surface 

£n(W)  with  respect  to  W  evaluated  at  W  =  V*n  .  (A  typical 

choice  for  Z  is  [  1 

n 


=  -2  (d  -  W  X  )X„  . 
n  n  n  n  n 


(3.2) 


However,  a  number  of  other  choices  have  been  considered  in 
the  literature  [42]-  [ 50]  -  For  a  further  discussion  on  methods 
of  obtaining  Zn  ,  the  reader  is  referred  to  Appendix  E, 

Section  B. ) 

The  important  thing  to  note  about  Zn  ,  for  the  moment, 

is  that  if  it  were  a  good  estimate  of  Jn(Wn)  •  then  one 

would  expect  wn+^  *  given  by  (2.1),  to  be  a  better  estimate 
_  * 

of  W  than  W„  .  Recall  that  a  function  changes  most 
n  n 

rapidly  in  the  direction  given  by  its  gradient.  Hence,  by 
moving  along  the  gradient,  one  is  moving  down  the 
quadratic  surface.  However,  in  the  nonstationary 
case,  one  would  rather  have  wn+^  as  a  good  estimate  of 
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W  ,  ,  since  the  next  data  vector  processed  by  the  receiver 
n+l. 

is  X  ,  .  w*  is  the  optimum  weight  vector  for  this  input, 
n+l  n+l 

Motivated  by  the  Kalman-  Bucy  theory  [56] »  the  following 
argument  is  presented  for  modifying  (2.1)  in  order  to  track 
the  sequence  (W*)  . 

Let  us  assume  that  the  input  nonstationarity  is  such  that 
the  optimum  weight- vector  sequence  (^n)  can  be  modeled  by  the 
discrete- time  system 


W 


n+l 


pn<V+un  ' 


(3.3) 


where  Fn(‘)  is  some  function,  not  necessarily  known,  and 
fUn)  is  a  zero- 
moment  given  by 


fUn)  is  a  zero- mean  random  process  with  finite  second 


E[l!un!!2]  =  <  oo  t 

According  to  this  model,  which  is  similar  in  form  to  the 

usual  linear  discrete- time  model  introduced  by  Kalman  [55], 

the  optimum  weight  vector  undergoes  a  first-order  Markov  random 

walk.  The  problem  for  the  adaptive  process  is  to  track  W*  . 

n 

The  problem  here  differs  from  the  Kalman  problem  in  that  here 
Fn ( - )  need  not  be  linear  and  the  random  process  [Un)  need 
not  be  an  independent  Gaussian  random  process.  (More  will  be 
said  about  this  model  in  the  next  section.) 

Since  [  U  }  is  zero-mean,  from  (3.3)  one  sees  that 
Fn  (W* )  ■Ls  an  unbiased  estimate  of  W*+1»  Hence,  if  a  weight 


. ' 


is  the  Euclidean  norm  defined  by  |fu(f  =  UTU  . 
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that  F  (W)  would  be  a  good  estimate  of  W*  ,  .  Therefore. 
11  n+i  ' 

a  logical  modification  of  the  algorithm  (2.1)  is 


wn+l  *  pn(wn-  Vn> 


(3.4) 


Unfortunately,  in  many  applications,  one  will  not  have 
_a  priori  knowledge  of  the  sequence  of  functions  (Fn)  or 
the  nonstationarity  cannot  be  modeled  by  (3.3).  It  may 
still  be  desirable  to  use  an  algorithm  of  the  form  (3.4). 

Let  (Gn)  be  a  sequence  of  functions  determined  by  the 
experimentor .  (Gn  could  be  an  estimate  of  Fn  ,  based  on 
physical  measurements,  for  example,  or  on  .a  priori  knowledge.) 
The  algorithm  (3.4)  becomes 

Vl  '  GnWn-  Vn>  •  <3-5> 


(Some  specific  algorithms  of  the  form  (3.5)  are  given  in 
Appendix  E.)  This  is  the  adaptation  algorithm  to  be  consi¬ 
dered  in  this  research. 

The  procedure  (3.5)  is  similar  to  the  algorithm  proposed 
by  DeFigueiredo  [24]  for  learning  the  unknown  mixture 
distribution  in  a  pattern  recognition  problem  in  which  the 
environment  is  allowed  to  evolve  in  time.  However,  his 
formulation  requires  exact  knowledge  of  the  nonstationarity, 
i.e.,  both  (Fn)  and  {UR}  are  assumed  known.  Chein  and  Fu  [23] 
also  considers  a  similar  problem  to  that  of  DeFigueiredo. 

Here  again  the  nonstationarity  must  be  known  exactly. 
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The  analysis  of  (3.5)  given  in  Chapter  IV  of  this 
research  requires  exact  knowledge  of  only  f F  )  and  (p^)  . 
The  worst- case  analysis  for  (3.5)  presented  in  Chapter  V 
removes  even  this  restriction. 
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B.  PRELIMINARIES 
1.  Assumptions 

All  vectors  in  this  paper  are  contained  in  the 
Euclidean  p-  space  Ep  . 

* 

It  is  easily  verified  that  the  gradient  J_  (W)  =R^(W-W  ) 

n  n  n 

of  the  mean- squared- error  surface  £fi(W)  satisfies  the  two 
conditions : 

Condition  (Cl).  For  all  W  e  Rp 

l|Jn(W)ll2  *max(n)||w~wn112  *  (3.6) 

and: 

Condition  ( C2 ) .  For  all  W  e  Rp 

(W_Wn)Tjn(W)  Amin(n)||w"  Wnl,Z  '  (37) 

where  _ (n)  and  A  (n)  are  the  minimum  and 
min  max 

maximum  eigenvalues  of  Rn  respectively. 

It  will  be  assumed  that  there  exist  positive  constants  A+ 

,  * 

and  A  such  that  for  all  n: 

Condition  (C3). 

°<Kz  ^  W"1  ^  <  “  •  (3-8) 

In  one-dimension  these  conditions  require  that  J  (W)  be 

bounded  between  the  lines  A  (w- W  )  and  A  (W- W  )  . 

w  n  n 

(See  Fig.  3.1.) 
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Fig.  3.  l.  Illustration  of  Conditions  Placed  on  J  (W) 
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We  assume  that  the  gradient  is  measured  in  such  a  way 
that  the  sequence  of  gradient  estimates  (Zn)  satisfies 
the  two  conditions: 


For  all  n 

Et2nlMn-wnl  =  J„<V  ' 


(3.9) 


Condition  (C5 ) »  There  exist  positive  constants  and 

o_  such  that  for  all  n 


E[l|Zn"Jn(Wn)l,2lWn'Wn)  °1  +  °2  l,Wn  "  Wn 11 2  *  (3*10) 


Condition  (C4)  is  the  requirement  that  the  gradient  measure¬ 
ment  be  conditionally  unbiased.  This  is  consistent  with 
the  desire  that  algorithm  (2.1)  be  based  on  the  method  of 
steepest  descent.  Condition  (C5 )  reflects  the  experience 
that  as  one  is  farther  from  the  optimum,  the  instantaneous 
mean-squared  error  becomes  a  noisier  estimate  of  the  expected 
mean- squared  error.  (For  a  zero- mean  Gaussian  random  varia¬ 
ble,  the  variance  in  its  second  moment  is  twice  the  second 
moment  squared  [42].)  Hence,  the  instantaneous  gradient 
estimate  should  be  expected  to  increase . 

Condition  (C5)  is  satisfied  easily  by  most  gradient 
estimates  (see  Appendix  E,  Section  C).  The  condition  (C4) 
on  the  other  hand  is  almost  never  satisfied  exactly.  In 
order  for  (C4)  to  hold,  one  essentially  has  to  require  that 
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the  data  pair  (dn,Xn)  be  conditionally  independent  of 
the  previous  data  pairs.  This  insures  that  (dn,Xn)  is 
independent  of  wn  •  However,  it  turns  out  that  for  a  large 
weight  system  (p  » 1 )  ,  condition  (C4)  is  a  reasonable 
assumption. 

The  model  used  in  Chapter  IV  for  the  nonstationarity 


is  that  given  by  (3.3), 


Wn+1  *  Fn<Wn>+Un  ' 


(3.3) 


where  fun)  is  a  zero-mean  random  process,  not  necessarily 

an  independent  Gaussian  random  process.  The  evolution 

transformation  F„  also  need  not  be  linear.  However,  it 

n 

will  be  assumed  that  Fn  satisfies  the  Lipschitz  condition 


lFn(W)~  Fn(V) 

llw-  vll2 


f  <  °°  , 

n  ' 


(3.11) 


where  the  supremum  is  over  all  weight  vectors  W  and  V  . 

This  condition  is  weaker  than  one  requiring  that  the  deriva¬ 
tive  of  Fn(W)  exist  and  be  bounded  for  all  W  .  Note  that 
the  condition  (3.11)  does  require  that  Fn  be  continuous 
in  W  . 

The  purpose  of  the  Lipschitz  condition  is  to  bound  the 

the  maximum  change  or  stretching  of  allowed  under  Fn  . 

If  two  vectors  W  and  V  are  close  together,  one  wants  the 

transformed  vectors  F  (W)  and  F_(V)  to  be  close  together. 

n  n 

Another  way  of  putting  it  is  if  two  vectors  are  close  together, 
the  effect  of  Fn  operating  on  them  should  be  similar. 
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» 


It  will  also  be  assumed  that  the  functions  Gn(") 
satisfy  the  Lipschitz  condition. 


sup 


|6„  (W)  -  Gn  (V) 

llw-vll2 


A  2  . 

■  9n  <  ” 


(3.12) 


where  the  supremum  is  over  all  weight  vectors  W  and  V  . 


2 .  Mathematical  Approach  to  the  Analysis  of  (3.5) 

The  straightforward  approach  to  the  analysis  of  the 
algorithm  (3.5)  would  be  to  develop  a  recursive  relation 
for  E[^nfwnH  *  the  expected  mean-squared  error,  and 
evaluate  this  expression.  However,  this  approach  suffers 
the  drawback  that  it  leads  to  the  setting  up  of  the  problem 
in  a  randomly  time- varying  metric  space.  It  is  worthwhile 
to  pause  a  moment  and  see  how  this  comes  about. 

Starting  from  (1.3)  and  using  (1.4),  it  can  be  shown 
[  ]  after  some  (easy)  algebra  that 

«n<V  -  ?n+  <Mn'Wn>TRn  <"„'<>  •  <3‘13> 

Note  that 

Sn<V-e£  -  ("n'VX 

is  the  excess  mean- squared  error  due  to  using  the  weight 

★ 

vector  Wn  rather  than  the  optimum  one  Wn  .  The  expected 
value  of  this  expression,  defined  by 

Cn  '  Et^n-VTVWn-Wn>]  ■ 


(3.14) 


22 


is  tht  e  .pected  excess  mean- squared  error .  It  is  a  measure 

of  the  cost  associated  with  not  having  the  necessary  a  priori 

*  2 

knowledge  to  compute  Wn  ..  This  quantity  Cn  provides  a 

measure  of  the  efficiency  of  the  adaptive  algorithm  (3.5). 

2 

A  large  excess  cost  Cn  would  indicate  that  the  algorithm 

is  not  tracking  the  sequence  {w*}  very  well,  while  a 

2 

small  excess  cost  Cn  would  indicate  that  the  algorithm  is 

2 

working  well.  A  recursive  relation  for  Cn  is  desirable. 
Unfortunately,  the  expression 


<Mn-VTRn  <W„-Wn> 


is  the  equivalent  to  defining  a  random  time- varying  norm  on 

the  weight- space  because  Rn  varies.  This  adds  a  further 

complication  to  the  problem  because  the  changes  in  both 
★ 

R  and  W  have  to  be  characterized  in  order  to  develop 
n  n 

a  recursive  relation  for  (3.14).  Note,  however,  that 
using  assumption  (3.8),  one  can  obtain  the  inequalities* 


Wn-  wn"2)  <  E[  (Wn-  W*)TRo  (wn-  Wn)]  <  *"E[||Wn-  w, 


r*' 

n 


n 


n' 


n  n 


* 


n 


n 


]  ,  (3.15) 


which  follow  from  the  trace  inequalities  [47]  given  by 


The  inequai ities  (3 . 15 )  follow  from  the  trace  inequalities  by 
noting 

<Wn-VVwn-wn>  =  trtRn  <V  Wn>  <)T)  • 

Make  the  identification  A  =  R  and  B=  (W_  -  W„ )  (W^  -  W  )T  . 

n  n  n  n  n 
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\nin<A,tr'B>  ^  tr'AB>  *  Xmax(A)trf3>  • 

where  A  and  B  are  two  symmetric,  positive- semidefinite 

matrices  of  the  same  order  and  A  .  (A)  and  A  (A)  are 

mm  max ' 

the  minimum  and  maximum  eigenvalues  of  A  ,  respectively. 
Expressing  bounds  on  the  excess  mean- squared  error  in  terms 
of 

=  E[llwn“  W*||2]  (3.16) 

avoids  the  randomly  varying  metric  problem.  For  this  reason, 
the  expression  (3.16)  will  be  considered  in  this  research. 

Referring  back  to  the  algorithm  (3.5),  one  sees  that 
the  sequence  of  random  weight  vectors  (Wn)  depends  on  the 
choice  of  the  sequence  This  sequence  fM^}  controls 

the  stability  and  rate  of  convergence  of  the  algorithm  (3.5). 
In  the  following  chapters  the  asymptotic  properties  of  the 
sequence  {bnJ  are  investigated  as  a  function  of  the  choice 
of  (Hn)  . 


IV.  CONVERGENCE  PROPERTIES  OF  THE  ADAPTATION  ALGORITHM 


In  this  chapter,  asymptotic  properties  of  the  sequence 
fb2  =  E  [  I  -  W*  ||]  2)  are  investigated  as  a  function  of  the 
convergence  factors  (hn)  for  the  case  in  which  the  environ¬ 
mental  functions  (Fn)  are  known.  The  corresponding 
adaptation  algorithm  to  be  used  is  (3.5)  with  Gn  = Fn  • 
given  by 


W 


n+1 


=  F  (W  -  n  Y  ) 
n  n  n  n 


(3.5) 


The  first  three  results  to  be  established  below  demonstrate 
the  tracking  ability  of  the  constant-u  algorithm,  while  the 
last  result  provides  sufficient  conditions  for  the  appli¬ 
cation  of  the  corresponding  stochastic  approximation  algorithm. 


A.  CONVERGENCE  PROPERTIES  OF  THE  CONSTANT- U  ALGORITHM 

The  following  theorem  and  its  proof  provide  many  key 

results  used  in  obtaining  b&vyids  on  the  sequence 

(b2  =  E [  W  -W*!'2])  in  the  subsequent  discussion.  The 
nun 

theorem  is:  \ 

Theorem  4.1.  Assume  that  the  optimum  weight  sequence  (w*) 
is  generated  according  to  (3.3), 


W 


n+1 


Fn<V+Un 


(3.3) 


Assume 

E[Un]  =  0 

and 


-24- 
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p2  =  lim  sup  E[||Un||2]  <  co 

n->oo  1 


Let  the  adaptive  processor  be  described  by  (3.5)  with 

G  =  F  , 
n  n 


Wnxl  =  Fn  (W  -  M-Y  ) 
n+i  n  n  n 


(3.5) 


Assume  that  the  sequence  of  functions  (Fn)  satisfy 
the  Lipschitz  condition  (3.12)  with 


lim  sup  f  =  f  <  1  . 
n 

n-*» 

Assume  that  the  sequence  of  gradient  estimates  (Zn) 
satisfy  conditions  (3.9)  and  (3.10),  which  are 


E(zn|wn,w;]  =Jn(Wn) 


(3.9) 


EfHZn"  Jn{Wn,,|;ilWn'W^]  °1  +  °2 !,Wn "  Wn 11  ^  *  (3*10) 


Define 


b  l  =  E[||Wn- W*||2] 


(3.16) 


Then,  if 


0  <  H  < 


.  „2  ' 


+  at 


one  can  conclude 


lim  sup  bn  < 

n-*oo 


+  p 


n  1-  f[l-  2^.  +n2(A*2  +  o2)]}s 


(4.1) 
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Remark :  Note  that  this  bound  applies  even  for  the  case  in 
which  the  stochastic  driving  sequence  fu  )  is  correlated 
in  time.  Moreover,  the  sequence  fun)  may  be  correlated 
with  either  the  optimum  weight  sequence  (W*)  or  the  adaptive 
weight  sequence  (Wn)  or  both.  The  only  requirement  is 
that  (Un)  be  zero-mean  and  asymptotically  bounded  in 
expected  norm- squared .  An  example  of  this  type  of  environment 
is  one  that  can  be  modeled  as  a  finite- state  Markov  dependent 
switching  environment. 

The  general  form  of  the  bound  (4.1)  is  shown  in 
Fig  .  4.1  for  the  two  cases  f  =  1  and  f  <  1  , 
respectively.  Note  that  in  both  cases,  the  convergence 
factor  P-  which  minimizes  the  lim  sup  of  (bn)  is 
different  from  zero.  This  is  due  to  the  unknown  component, 
fUn)  ,  of  the  nonstationarity .  More  will  be  said  about  this 
effect  later  in  discussing  Corollary  4.1.1  and  4.1.2. 

However,  despite  the  general  applicability  of  the 
bound,  the  real  importance  of  this  theorem  lies  in  its 
proof.  The  methodology  used  here  demonstrates  the  power  of 
the  formulation  developed  in  Chapter  III. 

Proof  of  Theorem  4.1. 

Subtract  W*+1  ,  as  given  by  (3.3) from  both  sides  of 
(3.5)  to  obtain 

Wn+1  “  Wn+1  *  VWn-|JV-VMn)-un  *  <«•*> 

Using  Minkowski's  inequality  (6lJ #  which  states  that 


Bound 
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(E[(X+Y)2]}  <  { E  [X2]  }  +  fE[Y2]}  ,  (4.3) 

one  concludes  that  based  on  (4.2) 

bn+l^  (EtHPn{Wn- -  Fn(W*)  II2)  3)s+  P„  .  (4.4) 

where 

bn  *  E(H»n-<l|2l  • 

and 

Pn  =  EfHun  H2J  • 

The  evaluation  of  (4.4)  proceeds  as  follows:  By  the  Lipschitz 
condition  on  Fn  ,  given  by  (3.12),  it  follows  that 

|,Fn'Wn  -  11 V  -  Fn'Wn)l!^  fX  ‘  Wn  ‘  MZn«2 

*  £n(|,Wn  -  Wn«2  ’  2^<Wn  ‘  V  V  b2  »Zn“/  •  (4’5) 

By  (3.9)  and  (3.7)  it  follows  that 

E  [(wn  -  Wn>\]*  E[<Wn  "  Wn>T  ^ 

’  E[(wn  -  «;>T  jn(wnu 
2 

>  K.£[j|wn  -  w*ll  ]  - 

by  (3.9),  (3.10)  and  (3.6)  it  follows  that 

e[fe„|12]  =  -  J„(Wn)  +  Jn(Mn)ll2] 

-  Ef'Zn  -  W1'2]  +  E[»  Jn  <Wn> l|2] 

+  2E[Jn(wn)E[zn  "  Jn  <  V  'Vwn  >] 

.£  o2  +  (a2  +  X>2)E|j|Wn  -  W;||2j  .  (4.7) 


Putting  Eqs.  (4.5),  (4.6) ,  and  (4 . 7) ,  together,  one  has  the  result 


liZn>  -  Pn<Wn>112]^  fl\^'  2^< 


2  tr?-  J.  >  *2  \  I  -k2 


+  U‘ (o‘ + 


]b" 


*  ‘‘‘‘A 


(4.8) 


Once  more  using  the  Minkowski  inequality,  Eq.  (4.3),  conclude  that : 


(E[l|Fn(Mn-  '  F„K»2])^  2liX 


2(2  ,*2 
*  +  ^  (°2  +  A 


f  '■ 


+  ui^o  . 


From  the  above  and  (4.4),  the  key  recursive  relation, 

bn+l  fn  [b -  21iX*  +  1x2  (°2  +  X*2)]  ^  bn  +  ''  £n°l  +  P„  >  <4 ' 9  > 


follows.  Iterating  (4.9)  backward  to  k  =  N  yields 


TT  ^  bN  +  X  TT  ai 

k=N  KJ  N  *“NL  j=k+l  3J  i 


(4.10) 


where 


ak  - 


2M-A*  +  \±‘ 


(X*2  +  °2)]'i  ■ 


0k  =  +  pk  . 

and 

kil  =  1  • 

By  definition  of  limit  superior  [58]  •  one  ^as  the  result  that 
for  any  s>0,  there  exists  an  Nt.  such  that  for  all  n>N_ 


fn  <  £  +  e  ’ 


and 
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Pn  <  P  +  E  • 


Pick  the  e  such  that  for 


any 


2\ 

0  <  "  <  7^ - 2 

*  +  or 


it  is  also  the  case  that 


max 


0  2  L _ i_\ 

<  V-  < 

2\ 

-  max 

°'t{ 

1  1  ^ 

'  \K  i+V 

** 2 ♦ 0? 

1_  f+ej 

_  - 

2 

- 

where 


max[x,y]  = 


x  x  >  y 

y  y  >  x 


This  guarantees  that 

(f+s)£l-  2V\  +  42^*2  +  o^j  ^ 
Hence,  for  any  n  _>  N  ,  one  has  from  (4.10) 

L# 


V  /  n,n+l”N=-  V, 

bn+l  <  a  6  b 


n 


=  (X 


n+1- 


Nsfb  .  _&.T 

IX  ^'J  !-'>  ' 


where 


and 


a  =  (f  +  s)£l-  +  it2^A*2  +  o2^j 


3  =  ( f  +  s )  M-cr^  +  p  +  e  . 


Since  i<l  ,  for  all  6>0  ,  there  exists  an  M6  >  Ne  such 
that  for  all  n>Mr 


0- 

n-  Nr 

a 

Vi  -  — - 

0 

1-a 

-J 

<  6 
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« 


Therefore,  for  all  M&  , 


bn+l  <  1^  +  6  < 


M'Oj.f  +  p 

-  f  1-  2M-\  +  |i2^X*2  +  o*) 


+  7 


where  7  >  0  can  be  made  arbitrarily  small  by  choosing  a 
sufficiently  small  e  and  6  .  From  this  it  follows  by 
definition  of  limit  superior 


lim  sup  bn  £ 

n-*oo 


P-o^f  +  p 


1-  f  1-  2M-\  +  ^2^*2  +  o2^  ** 


(4.1) 


This  completes  the  proof  of  Theorem  4.1. 


Two  important  special  cases  of  the  problem  handled  by 
the  previous  theorem  are  when: 

(i)  the  nature  of  the  nonstationarity  becomes 
determinsistic  in  time  (the  random  driving 
process  (Un)  goes  to  zero  in  expected 
norm- squared;  i.e.,  p  =  0).  Examples  of  this 
case  are: 

a)  stationary  statistics.  Here  Fn(W)=W  and 

U  =0  for  all  time.  This  is  the  customarily 
n 

treated  problem  in  adaptive  system  theory  [42]~[47l* 

b)  asymptotically  stationary  statistics.  Noise 
sources  may  be  initially  present  that  eventually 
move  out  of  range  of  the  receiver. 

c)  known  varying  channel.  Measurements  can  be 
performed  on  the  channel  so  as  to  determine 
its  effect  on  the  input  statistics  to  the 
receiver. 
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d)  known  constraints  placed  on  the  weight  vector. 
It  may  be  desirable,  for  example,  to  control 
the  frequency  response  of  the  array  processor 
in  a  given  direction  while  nulling  out 
signals  coming  from  other  directions  (see 
[40  ]  or(48  1  or  Appendix  E,  Section  B)  . 

(ii)  the  nature  of  the  nonstationarity  is  strictly 

first-order  Markoff  (the  random  driving  process 
f u  )  is  a  zero-mean,  independent  random 
process).  An  example  of  this  problem  is  where 
the  input  data  x  is  the  output  of  a  linear 
randomly- time- varying  channel  with  additive 
white  noise  [53]-  [54].  The  object  of  the 
filter  is  to  predict  the  next  input  xn+1  on 
the  bases  of  the  previous  p  inputs. 

The  theorems  are : 


Corollary  4.1.1.  Under  the  hypothesis  of  Theorem  4.1,  if 
p  =  0  ,  then 


,  2 
lxm  sup  bn  £ 

n-*ro 


H2o2f2 


1-  f2  1  -  +  l^2  (**2  +  o2J 


(4.11) 


and 


Corollary  4.1.2.  Under  the  assumptions  of  Theorem  4.1,  if 
f U  }  in  (3.3 )  is  a  zero-mean,  independent  random 
process,  then 


lim  sup  b* 
n-*oc 


H2o2f2  +  p2 


l-£2[l-2^.  +  42(x*2  +  o2) 


(4.16) 
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Remark  1;  The  bounds  (4.11)  and  (4.16)  are  shown  in  Figs. 

4.2  and  4.3,  respectively.  While  the  form  of  the  bounds 
(4.1)  and  (4.16)  are  similar  for  a  given  set  of 
parameters,  the  bound  (4.16)  is  tighter  than  that  of  (4.1). 


Remark  2;  One  should  note  that  by  choosing  a  sufficiently 
small  ,  the  bound  (4.11)  can  be  made  arbitrarily 
close  to  zero,  i.e.. 


2 

lim  lim  sup  b‘  *  0  . 

U-*0+  n-*oo  n 

An  algorithm  with  this  property  is  said  to  be  e-optimal 

fel  ]. 


Remark  3  ;  For  the  stationary  statistics  problem  in  which 
Fn(W)  =  W  and  Un  =  0  ,  the  bound  given  by  (4.11) 
reduces  to 


lim  sup  b„  < 
_  n  -* 

n-»» 


2A*  -  U 


(^) 


This  is  the  result  given  by  Daniell  [50  ]. 


Proof  of  Corollary  4.1.1. 

From  the  proof  of  Theorem  4.1  (see  (4.4)  and  (4.8)), 
it  follows  that 


+  V-  (A 


By  Theorem  4.1,  lim  sup  bn  is  finite  for  0  <  M- < 

n-»°o 


pn  '  (4.12) 
2\ 


Bound 
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Using  the  result,  proven  in  Appendix  A,  that 


(lim  sup  b  V 
n-*°°  / 


1 2  2 
sup  b  \  =  lim  sup  bfi 

►oo  /  n— 


it  then  follows  from  (4.12)  that  for  any  s>0  ,  there  exists 

an  N  such  that  for  n  >  N  , 

e  —  s 


*2  ✓  f2| 

bn+l  ^  £ 


fl-  2H\  +  )i2  (**2  +  o2  )Jb2  +  |i2£2o2  +  e  .  (4.13) 


Noting  the  similarity,  of  (4.9)  and  (4.13),  one  has  by  analogy 
to  (4.1) 

lim  sup  bl  <,  - yT -  2 — 15 - TT“ 

n-»«>  n  1-  [l-  +  H^(\*^  +  o‘  )| 

This  completes  the  proof  of  Corollary  4.1.1. 


Proof  of  Corollary  4.1.2. 

Square  both  sides  of  (4.2)  and  take  the  expectation 
to  obtain 

E  l  ''wn+l  *  Wn+l"2*  =  E(||Pn(Mn-WZn)-F„(w;),|2l+Et!lun|l2) 

-  2E['Fn(Wn"  Wn))Tun'-  2E ' fFn <wn> ,Tun> 

By  the  independence  assumption  on  the  sequence  {Un}  »  ^ 
follows  that 

F([Fn(wn-“zn))Tun1  -  tErFn<w„-^n),jTE[un) 


(4, 


17) 


0 


9 


37 


and 


E[tPn(w;nTuni 


fE[Fn(Wn)J  ]TEfUnJ 
0  . 


Recall  (4.8)  from  the  proof  of  Theorem  4.1, 


E[l|Fn(Wn“^Yn)“Fn(Wn),,2]  ^ 


Therefore, 


bn+l  fn^“  tiA*  +  ^2^*2  +  °2))bn  +  ^2°ifJ+Pn  • 

Comparing  (4.18)  with  (4.9)  in  the  proof  of  Theorem  4. 
can  argue  by  analogy  that 


2 

lim  sup  b  <, 
n-»°°  n  “ 


H2o2f2  +  p2 


(l  -  2[i\  +  M2  (A*2  +  a2)] 


1-  f4 


(4.8) 


(4.18) 
1,  one 

(4.16) 


This  completes  t'xe  proof  of  Corollary  4.1.2. 
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B.  CONVERGENCE  PROPERTIES  OF  THE  STOCHASTIC  APPROXIMATION 
ALGORITHM 

For  a  subclass  of  the  problems  considered  in  the  previous 
section,  the  sequence  of  random  vectors  fWn)  generated  by 
(3.5)  converges  in  a  strong  probabilistic  sense  (e.g.,  mearv- 
square  convergence  and  probability  one  convergence)  to  the 
sequence  (w*)  .  The  following  theorem  provides  a  set  of 
sufficient  conditions  for  convergence: 


Theorem  4.2.  Assume  the  optimum  weights  are  generated  by 
(3.3)  where  fun)  is  a  zero-mean  independent  random 
process  with  E[||Un|l  ]  *  p*  where 

p2  <  00  . 

n 

Let  the  adaptive  processor  be  given  by  (3.5)  with 

G  =  F  .  Assume  (F  )  satisfy  (3.12).  If  (u  }  is 
n  n  n  n 

a  sequence  of  non^negative  real  numbers  satisfying 


k=l 


p  2  <  00 

K 


and  if 


2 

(i)  for  sufficiently  large  n,  f*(l->-  1-Ln^*)  1 


(ii)  £  (!-  fk(1"  likA*))  =  00  ' 


k=l 


2 

lim  E[  |IWn-  W*!|  ]  =  0 

n-»°o 


( 


» 


then 
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and 


★  2 

P[lim  ||W  -  W  ||  =  0]  =  1  . 

n-»« 


Proof  of  Theorem  4.2. 

Since  convergence  in  mean- square  will  be  needed  to  show 
convergence  with  probability  one,  this  aspect  will  be 
considered  first. 

Pick  such  that  for  all  n  ^  , 

2*. 


2 

*  +  °2 


Then  for  n  2.  ,  one  can  obtain  the  equivalent  expression 

to (4. 18)  in  the  proof  of  Corollary  4.1.2, 


where 

and 


b*  <  a  b2  +  £ 
n+1  n  n  Kn  ' 


a 


-  £n[1-2%^  +  ,1n(**2+  °j)] 


P_  - 


n 


, .  2  ,2  2  A  2 
M.  f  o,  +  p 
n  n  1  ^n 


(4.19) 


(4.20) 

(4.21) 


Therefore,  by  recursive  substitution,  one  has  for  n>N^ 


a. 


+ 


(4.22) 


An  especially  straightforward  proof  of  convergence  of 
(4.22)  is  based  on  Kronecker's  lemma,  proven  in  Appendix  B, 
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Lemma  4.2.1  ^Kronecker 1  s 


Let  ^'Xk'  a  sequence  of 


real  numbers.  Let  (a^)  be  a  sequence  of  positive 

numbers  converging  monotonically  upward  to  infinity. 

If  X  aT  5  sn  converges  to  some  finite  number, 
K=1  k 

say  s,  then 


f  t 


T  L  *k  =  0  • 

n-*®  n  K=1 


Returning  to  (4.22).  assume,  without  a  loss  of  generality, 
that  =  1  .  Make  the  identification  with  Lemma  4.2.1, 


that 


Therefore,  if 


aT  -  jK 

k  3=1  J 


“  akek 


TTV* 


0  as  n-*= 


(4.23) 


and  if 


(4.24) 


then  lim  b„  =  0  . 
n-*«> 


To  show  (4.24),  note  that  f£  is  bounded. 


Therefore,  it 


follows  that 


,£[  »2A°\  *  #  < 


oo 
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if 


To  show  (4.23),  use  the  inequality,  e“x  2  1-x 
to  obtain 


ft  a  ^  exp 
k=l  K 


for  all  x  , 


Therefore  it  suffices  to  show  that  there  exists  an  N2  such 
that  for  all  n^  N2  ,  an  1  ,  and 


n 

T  U-av)  — »  oo  as  n~»«> 

k=l  K 

Now 

■  1-^[1-a‘kx.  +  "k(x*2  +  ^)]  • 

Since  ^-*0  ,  one  can  assume  without  loss  of  generality  that 
for  all  k 

ak^  • 

By  hypothesis  of  theorem,  it  immediately  follows 

lim  E[||wn-  W* || 2  ]  =  0  . 

n-** 

To  show  convergence  with  probability  one,  take  condi¬ 
tional  expectations  to  obtain,  in  an  analogous  fashion  to  (4.19) 

Et“Wn+l-Mn+lH2!  'V<J  ^  an“Wn-<»2  +  en 


9 
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where  an  and  3n  are  defined  by  (4.20)  and  (4.21). 
Therefore 


E'llwn+rVi!|2H!V<iiJ)  -  E[E»Vr<+l»2lMn'<l  I«vwn»2] 


^anllwn-w;»2*3, 


(4.25) 


Note  that  if  a  <  1  and  3=0,  then  one  could  use  the 
n  n 

martingale  convergence  theorem  to  prove  convergence  of 
(4.25).  However,  3n  ^  0  .  The  following  lemma,  proven  in 
Appendix  C,  provides  the  necessary  result  to  show  convergence. 


Let  {x^}  be  a  random  process  such  that 

(i)  sup  E[  |X.  |]  <» 

k  K 

(ii)  E[Xk+1Jxk . Xx]  >  Xk- ak  for  all  k  . 

where  (a^)  is  a  sequence  of  non- negative  real  numbers 
such  that 


Z  a.  <  ^ 


Then  with  probability  one. 


lim  X  =  X  where  E[  (X  |  ]  < 
n  °o  °° 


Make  the  identification  with  Lemma  4.2.2# 


and 


-*n  -  "Wn-Wnl|2 


3  =  a 

n  n 


By  the  first  part  of  this  proof,  it  has  been  shown  that 


If  "U  I'll 
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a  <  1 

n 


n 


and 


r  P  <«, 

K-l  n 


lim  b*  »  0 
n 

n-^oo 


Hence,  S^p E  [  Hwn “  Wn^]  ^  °°  *  Therefore,  by  Lemma  4.2.2, 
one  has  with  probability  one 


Where 


Hm  ||wn-  W*  [! 2  -  . 

n-»oo 


E  [  I  x  I  ]  <®  . 


To  show  x  =0  ,  use  Fatou's  lemma  [61],  which  states  that 


E  lim  Xn  ^  lim  E[xn]  , 

n-»oo  n-»oo 

w 


to  obtain  the  chain  of  inequalities, 
0 


1  E  Tlirn  l!wn-W*!|2]  ^  lim  E  [llwn-  W*||2]  = 
Ln-»<»  J  n->oo  l  J 

Thus,  lim  l!wn-W*||2  =  0  a.e. 

n-»oo 

This  completes  the  proof  of  Theorem  4.2. 


0  . 


Remark:  With  Fn(w)=W  and  Un  =  0  for  all  n  ,  one  has 
algorithm  (3.5)  in  the  stationary  statistics  case.  By  the 
previous  theorem,  if 


* 

then  Wn  -»W  in  mean- square  and  with  probability  one  (see 

Appendix  E).  These  are  the  usual  conditions  required  in 

the  stochastic  approximation  literature  [17]-  [39].  However, 

one  important  difference  here  is  the  additional  variance 
A  o 

term,  °2  wn-wn  •  9i-ven  by  (3.10).  This  term  prevents  the 
application  of  Dvoretzky's  theorem  [35]  to  prove  convergence. 


V.  A  WORST- CASE  ANALYSIS 


Implicit  in  the  analysis  presented  in  the  previous 
chapter  is  that  the  nonstationarity  is  known  and  can  be 
modeled  by  the  discrete-time  system  (3.3).  If  the  nonsta¬ 
tionarity  is  unknown  or  cannot  be  modeled  by  (3.3),  the 
results  given  in  that  chapter  do  not  apply.  In  this 
chapter,  bounds  are  obtained  for  the  asymptotic  behavior 
of  the  adaptive  system  (3.5)  under  mild  restrictions  on  the 
optimum  weight  vector  sequence  fw*)  .  It  should  be  empha¬ 
sized  that  the  results  given  here  are  not  limited  to  the 
nonstationarity  model  (3.3). 

The  three  classes  of  nonstationarities  considered  are 
the  bounded- increment,  bounded- variation,  and  bounded- 
optimum.  The  bounded  increment  class  is  defined  to  consist 
of  those  sequences  (WnJ  for  which 

lim  sup  E[||W*+1  -  W*|I21  =  A2  <  «  ,  (5.1) 

n-*« 

the  bounded  variation  class  is  defined  to  consist  of  those 
sequences  (Wn)  for  which 

lira  £  (E[||W*+1  -  W*|f2]  <  «  ,  (5.2) 

n-*co  k=l 


and  the  bounded  optimum  class  is  defined  to  consist  of 
those  sequences  (W* }  for  which 


lim  sup  E[|lw*  -  Wq|!2]  =  B2  < 
n-*» 


9 


★ 

for  some  constant  weight  vector  Wq  . 


00 


(5.3) 
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The  emphasis  of  the  analysis  is  for  the  algorithm  (3.5)  » 

with  Gn(W)=W;  i.e.,  the  original  algorithm  (3.1).  However, 
the  theorems  stated  have  analogies  to  the  more  general 
algorithm  (3.5).  (The  extension  to  this  case  is  straight¬ 
forward.)  The  three  major  results  given  are  by  no  means 
exhaustive  of  the  possible  situations  that  can  be  encountered. 
Nevertheless,  they  do  illustrate  some  basic  approaches  and 
philosophy  for  the  handling  of  the  nonstationary  statistics 
problem. 

V 

The  analysis  presented  here  is  a  worst-  case  analysis 
for  (3.1)  in  the  following  sense.  Pick  the  convergence 
factor  sequence  {M^}  (M^  could  equal  M-  for  all  n)  . 

For  each  of  the  three  previously  mentioned  nonstationarities  • 

2 

the  corresponding  lim  sup  bn  is  the  asymptotic  bound  on 

n~*°°  *  2 

the  supremam  of  El''wn  ”  W  In]  over  all  possible  input  data 
pair  sequences  ( (d^» 

(3.6)-  (3.10)  for  a  specific  choice  of  \  ,  and  o 2  . 

(In  general,  a  different  sequence  (M-n)  will  result  in  a 

different  choice  of  the  sequence  (  (d^f X^) )  .  However,  the 
2 

bound  lim  sup  bn  may  remain  unchanged.)  In  other  words, 
n— *=° 

within  a  given  set  of  constraints  (conditions  (3.6)-  (3.10)  , 

(n  >  ,  and  the  class  of  nonstationarity ) ,  what  is  the  most 
manevolent  thing  that  nature  can  do  to  the  behavior  of  the 
algorithm  (3.1)  in  terms  of  the  sequence  (bn)  ? 


X^)}  that  satisfy  the  conditions 
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A.  NONSTATIONARITY  OF  THE  BOUNDED- INCREMENT  CLASS 


A  surprising  amount  of  information  about  the  performance 
of  the  adaptive  filter  can  be  inferred  from  knowing  only  that 
the  expected  change  in  the  norm  of  the  optimum  weight  is 
bounded.  This  is  sufficient  to  conclude  that  for  suitable 
choices  of  the  gain  parameter  M-  ,  the  optimum  weight  vector 
can  be  tracked  within  some  finite  distance.  The  result  is 
summarized  by 


Theorem  5.1.  For  the 
Chapter  III  with 


adaptive  filter  system  described  in 
G^W)  =W  ,  if 


lim  sup  E[||W*+1-  W* I! 2 ]  =  a2  <<oo  , 
n-»°o 


and 


0  < 


M-  < 


9 


then 

lim  sup  b  < 
_  n 

n-*°° 


A  + 

2M\+  V? 


2 


+  °2> 


(5.4) 


Proof  of  Theorem  5.1. 

* 

Starting  from  (3.5)#  with  Gn(W)=W  ,  subtract  Wn 
from  both  sides  to  obtain 

wn+l '  Wn+1  "  wn  -  <  ‘  '  <"*+l  -  <>  • 

Using  the  Minkowski  inequality.  (4.3),  one  concludes 

Vl  *  yE  t  llwn  -  W*  -  U2nll2]'  +  ^ 


(5.5) 
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where 


b;  = 


E[llwn  -  W*||2I 


Et»<+1  -  wn»2l 


Comparing  (5.5)  to  (4.4)  in  the  proof  of  Theorem  4.1,  one 
has  the  conclusion  (5.4)  by  direct  analogy  to  (4.1)  with 
f  *  1  . 

This  completes  the  proof  of  Theorem  5.1. 

It  is  interesting  to  note  that  by  using  the  conclusion 
of  this  theorem,  one  can  obtain  a  tighter  bound  by  returning 
to  its  proof.  This  observation  is  summarized  by 

Corollary  5.1.1.  Under  the  hypothesis  of  Theorehi  5.1, 


lim  sup  bn  <  -4-3  +  +  P  "  f 

n-*oo  l-a  y  '  l-a  /  1-a 


(5.6) 


where 


and 


a2  =  1-  2M\  +  H2  (o2  +  A*2) 


q2  .2  ,  ,.2_2 

p  =  A  +  O  . 


Proof  of  Corollary  5.1.1. 

Using  r4.8)  in  (5.5),  one  obtains 

bn+l  £  An  , 


(5.7) 


where 


■1  yv  i  ii  k  i .  i  ,i .»  ■ui!ii..«n<iiiiBnm*«nwvmr 


i  ■  ■■■nip 
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a2  =  1-  2V-\  +  p.2  (o2  +  A*2)  . 


Using  the  result  proven  in  Appendix  A ,  which  states 


n  *  (lim  auP  bnY 
\  n-*«  J 


lim  sup  b 
n-*« 


and  the  conclusion  of  Theorem  5.1  that  lim  sup  bn  is  finite, 

n-*» 

one  can  conclude  that  for  sufficiently  large  n 


n+1 


^  [a2  (b* ) 2  +  H2a2]  **  +  A  +  e  , 


(5.8) 


where 


and 


b*  =  lim  sup  b 

n-*oo 


e  >  0 


arbitrary 


From  (5.8)  it  follows  directly  that 

b*  ^  [a2  (b*)2  +  M-2o2]  ^  +  A 
Solving  for  b*  in  (5.9),  one  obtains 


(5.9) 


b*  <; 


'-^)2  ♦ 

,1-cT/  1-cT 


V-  A2 


1-a4 


(5.6) 


This  completes  the  proof  of  Corollary  5.1.1. 

Remark:  It  has  been  argued  by  Widrow  [49]  that  the  "rate  of 
adaptation  is  optimized  when  the  loss  of  performance  resulting 
from  adapting  too  rapidly  equals  twice  the  loss  in  per¬ 
formance  resulting  from  adapting  too  slowly."  Since  the 


- . — - — - —  —  — 
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rate  of  adaptation  is  inversely  proportional  to  the 
gain- constant  4  [42]#  an  equivalent  statement  to  the  one 
above  is,  "the  gain- constant  4  is  optimized  when  the  loss 
of  performance  resulting  from  adapting  too  rapidly  equals 
twice  the  loss  in  performance  resulting  from  adapting  too 
slowly."  Under  certain  conditions,  to  be  specified#  this 
rule  applies  very  closely  to  the  bound  (5.6),  as  shown  by 
the  following  argument: 

Assume  that  b*  (4)  ,  given  by  (5.6),  is  minimized  with 
respect  to  4  for 


4  « 


X*2  +  °2 


For  a  value  of  4  satisfying  this  condition  one  can  consider 
the  bound  -  — . 


b  W  ~2 


(5.10) 


The  component  due  to  changes  in  the  optimum  weight  vector  is 


(Set  0.^  =  0.) 
estimate  is 


! 


The  component  due  to  noise  in  the  gradient 


bMN 


(4) 


4. 


(Set  A  =  0 .  ) 
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Express 


b*  (4) 


as 


b*  (4) 


‘2bTV (M,) 


+  £bMN^*j 


Taking  the  derivative  of  b*  (4)  with  respect  to  4 
setting  equal  to  zero,  one  obtains 


(5.11) 


and 


2 


[bTV  ^0*]  “  [bMN 


(5.12) 


where  4q  is  the  value  of  the  gain  constant  4  which 
minimizes  b*  (4)  given  by  (5.11).  Solving  (5.12)  for 
4q  yields 


(5.13) 


Hence,  if 


2/3 

L  « 


>*2  ^  rr2 

A  +  o„ 


then  the  value  of  4  given  by  (5.12)  and  (5.13)  is  close 
to  the  value  of  4  which  minimizes  the  bound  (5.6).  In 
other  words,  for  a  slowly  varying  environment,  a  good  rule 
of  thumb  is  pick  the  gain  constant  4  using  Widrow' s  rule. 
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B .  NONSTATIONARITY  OF  THE  BOUNDED -VARIATION  CLASS 

It  may  readily  be  seen  that  Corollary  5.1.1  applies  to 

the  bounded  variation  problem  since  2  <  »  implies 

k  K 

-*  0  .  However,  within  this  class  of  nonstationarity  is 

it  it 

the  stationary  weight  vector  case,  «  W  for  all  k  ,  and 

it  +  + 

the  asymptotically  stationary  weight  vector  case,  -♦W  . 

It  will  be  shown  that  stochastic  approximation  algorithms 
^n*"*0^  can  a^so  b®  applied  with  success,  to  these  two  cases. 
The  result  is  summarized  by: 

Theorem  5.2.  For  the  adaptive  algorithm  given  in  Chapter  III 
with  Gn(W)=W,  if  the  sequence  [M-n)  satisfy 


i) 

4  > 

0 

n  — 

ii) 

n  n 

=  00 

iii) 

2 

<  03 

n  n 

tThe  fact  that  l|lw*  .  -  W*||  <  «  implies  limW*=W*  follows 


k-*°° 


from  the  inequality  (m>n)  , 

;!Mn-  "m11  =  '51<wk4.1-wk>^X  K+l-"> 


X  K+i  -  "k11  - 0  as  n-"  • 

V  n 

Hence,  by  the  Cauchy  convergence  criterion  [58  ]»  W  -  lin» 

k-»<» 


exists . 
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bn+l^  |‘VI'‘n"n  +  I  *  +  T  I  ‘V'i  +  ^  +  e  ^  . 
Iterating  backwards,  one  gets  the  result 


n+1 

where 


4a  M4*--  f  r.^  fi+^V:2i^  • 

Lk=N  \  /  J  **  /  3  \  * 


3; 


+  Ak  +  s  Ak  • 


As  done  in  the  proof  of  Theorem  4.2,  one  can  use  Kronecker’s 
lemma  (Lemma  4.2.1),  to  conclude  that  if 


and  for  some  N 


I  3 1< 


k*l 


7T 

n>N 


(l  +  7^0^  0 

I  sin 


then  lim  b  =  0  .  If 

n-oo  n 


and 


Z 

k=l 

00  ~ 

E 

Kil 


<  00 


E  A  <  °°  , 

k-1  k 


then  one  has  F  3r  <  »  .  Using  the  inequality  e”x  2  1-  x 
k=l  K 

it  follows  for  sufficient  large  N  , 


'7  r 


ii  i|  i  h  i  i  |  naVPi  1 1  i  l.  i  ■  -  • 


T 


T* 


n 

7 r 

(1+^) 1 

exp 

e:<p 

£ 

exp 

exp 

2 

Thus,  lim  b‘  *  0  . 

n-*oo 

The  probability  one  result  follows  in  a  manner  similar 
to  Theorem  4.2. 

This  completes  the  proof  of  Theorem  5.2. 
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C.  NONSTATIONARITY  OF  THE  BOUNDED  OPTIMUM  CLASS 

It  is  often  the  case  that  the  optimum  weight  vector  is 
known  to  lie  within  a  p-dimensional  hypersphere.  For  example, 
if  the  background  noise  field  fluctuates  about  some  average 
value,  the  optimum  will  fluctuate  about  a  fixed  vector.  The 
following  theorem  gives  an  upper  bound  for  the  sequence 
(bn)  for  this  model. 

Theorem  5.3.  For  the  adaptive  system  described  in  Chapter  II, 
if  there  exists  a  vector  Wg  and  a  positive  constant  B 
such  that 


lim  sup  |iW*  -  W^|l  <£  B  , 
n-*°o 


then 


lim  sup  bn  <  2B  +  min 
n-*°°  e>e 


s  +  4o' 


”C° \J  d--^  )  [2\-  4(A*2+o^)] 


(5.14) 


where 


2 

A  B 


*°  *2  , 
2**-4(*  +CJ*) 


(5.15) 
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* 


« 


4 


« 


Proof  of  Theorem  5.3. 

For  the  model  defined  by  the  hypothesis  it  is  not 
possible  in  general  to  bound  directly  the  sequence  (bn)  . 
However,  by  considering  the  sequence  {c*}  defined  by 

cn  -  EtllW,,-  »o  I'2  5 
2 

one  can  infer  an  upper  bound  for  fb*}  .  The  first  step  is 
to  obtain  a  recursive  relation  for  sequence  {c*}  . 

Subtracting  W*  from  both  sides  of  (3.5),  squaring,  and 
taking  expectations,  yields 

cn+l  ’  cn  *  ^Er<Wn-W*)\]  ♦  A[||*nl|2]  . 

The  second  term  on  the  R.H.S.  may  be  bounded  as  follows. 


*  \  T_ 


Et(wn-w0>  Zn] 


-  Et<wn-w;)TE[zniwn.w;]] 

-  E[(Wn-w;)TJn(Wn)] 


-  Et(W„-W*)TJn(Wn)]  + 


>-  -5<e+l  X‘ 


where  by  using  the  inequality  (Appendix  D), 


E[  |x  |]  &  I  E[x2]  ,  s  >  0  arbitrary. 


one  has 


*  *,t 

.t  r.v  \  * 


E[  (w;-  Wg)xJn(Wn)]  ^  E  [  ||W*  -  W*|l|!Jn(Wn)||]  1  -±(s  +^A"  aVl 


I/_  .  !->*%.  2^2 


Proceeding  in  an  analogous  manner  to  that  in  Theorem  4.1, 
one  obtains 


Therefore, 


E[l'Yn!!*]  £  +  (°2  ♦  x*4  ) 
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cn+l  ^  cl  -  rl  (E,bn  -*•  r2(e) 


(5.16) 


where 


r^fe)  =  2^ 


f .  -  4^)  -  ^*2  +  a*; 


r2  (e)  *  +  U-2o2 


Note  that  p^te)  >  0  provided  that 


e  > 


,*2  2 

^  B1 


and 


2\-  H(X*2  +  o^) 


e0(H) 


0  <  M.  < 


2\ 


x*2  2 

*  +  o* 


(5.15) 


To  bound  (5.16)  in  terns  of  c*  ,  use  the  Minkowski  inequality 


to  conclude 


From  this. 


cn-B£  bn^  cn+B  . 


'n+1 


^  [1-  ^(e)]^  +  2Br1(e)cn  -  rx  (e  )B2  +  r2(6)  •  (5.17) 


From  (5.17)  conclude 


lim  sup  cn  ^  B  + 
n-»°° 


/HEl 

yj  T1(eT  ‘ 


Hence,  again  by  the  Minkowski  inequality,  one  obtains 


lim  sup  b 
n-»°o 


/r,  (e) 

n  —  +  /i^7 
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Since  s^Sq  was  arbitrary,  one 

lim  sup  b  <  2B  + 
n  — 

n-*°° 

This  completes  the  proof  of 


can  conclude 


min 


IhEl 

V  ^(e) 


eis0 

Theorem  5.3. 


(5.14) 


The  following  corollary  gives  a  looser,  but  more 
tractable  bound. 


Corollary  5.3.1.  Under  the  hypothesis  of  Theorem  5.3 


lim  lim  sup  b  £  2B(1  +  — r— )  (5.19) 

4 ->0  n-*»  n  '  z  *' 


Proof  of  Corollary  5.3.1. 

Let  s  =  2s0  in  the  original  bound  for  lim  sup  bR 

n-»°° 

found  in  the  proof  of  Theorem  5.3.  After  a  little  algebra, 
the  desired  result  follows. 

This  completes  the  proof  of  Corollary  5.3.1. 

The  importance  of  the  results  given  in  this  section  is  that 

they  guarantee,  in  the  mean- norm- squared  sense,  that  if  the 

optimum  weight  vector  sequence  (W*)  remains  bounded  about  a 
* 

vector  WQ  ,  then  the  algorithm  (3.1)  will  yield  estimates 
within  a  finite  region  about  the  true  minima. 


VI.  AN  EXAMPLE 


The  purpose  of  the  example  discussed  in  this  chapter  is 
to  compare  the  theoretical  bound  (5.6)  with  an  experimental 
result  obtained  by  using  the  IMS  adaptation  algorithm  [42], 


w*+i  *  wk  - 


(6.1) 


where 


«k  -  \  -  wk*k 


As  shown  in  Appendix  E,  the  IMS  adaptation  algorithm  is  a 
special  case  of  algorithm  (3.5)  with  Gn(W)  *  W  and  parameters 


A.  =  inf  A  .  (R  )  , 
*  n  min'  n'  * 


**  *  T  WV  7 

7^i'(vy-wi2i 


(6.2) 


(6.3) 


(6.4) 


°2  =  T^Vn  -Rn"2^ 


(6.5) 


where  A  .  (R  )  and  A  (R  )  are  the  minimum  and  maximum 
min  n  max  n 

eigenvalues  of  Rn  *  respectively. 

The  criterion  used  to  measure  the  performance  of  the 

2 

adaptive  filter  is  the  excess  mean- squared  error  CR  as 
defined  by  (3.14).  Using  (3.15)  and  (5.6),  on  has  the  bound 


2  ^ 

n  l-a2  y/\l-a2/  1-a2 


lim  sup  c 


where  the  parameters  are  as  defined  in  Chapter  V. 


(6.6) 


-60- 
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The  object  of  the  adaptive  filler  in  this  example 

was  to  predict  a  random  process  fx^}  one  time- delay  ahead 
using  only  the  previous  value,  i.e., 

\  =  *k+l 

and 

*k  =  wk*k  ‘ 

The  data  was  generated  according  to  the  one- point  autoregressive 
scheme 

*k  -  V*-i  +  vk 

where 

Xj^  =  input  data  value  at  time  k 

»k  “  °-2  sin  ioo  +  c 

=  white,  stationary,  Gaussian  random  process 
with  zero  mean  and  unit  variance. 

2 

The  instantaneous  error  squared,  (x^^-w^x^)  '  was 

averaged  over  700  points  for  each  value  of  p.  used  in  the 
adaptation  algorithm  (6.1).  Two  experiments  were  conducted, 
one  with  c  =  0.0  and  the  other  with  c  =  0.5.  The  resulting 
averages  are  shown  as  a  function  of  n  by  the  solid  line 
in  Figs.  6.1  and  6.2.  The  corresponding  theoretical  bounds 
are  given  by  the  dashed  line  in  the  figures. 

The  discrepancy  between  the  two  curves  can  be  accounted 
for  by  the  following  three  observations.  First,  as  pointed 
out  in  the  previous  chapter,  the  theoretical  bound  is  a 
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worst-case  analysis.  In  other  words,  the  periodic  nature 
of  the  time- variability  of  the  data  is  not  exploited  by  the 
bound.  Second,  the  experimental  curve  corresponds  to  the 
average  mean- square  error  over  a  period  while  the  bound 
corresponds  to  the  maximum  mean- square  error  during  the 
period.  Last,  the  sequence  of  data  pairs  {(d^.x^)}  is  not 
an  independent  sequence. 

The  reason  for  the  larger  discrepancy  in  Fig.  6.2  than 
in  Fig.  6.1  is  due,  in  part,  to  the  increase  in  the  ratio 
and  its  effect  on  the  bound  (6.5).  For  the  case 
c  =  0.0  the  ratio  is  given  by 


1.04 


while  for  c*0.5,  the  ratio  becomes 


VII.  CONCLUSION 


A.  SUMMARY  OF  RESULTS 

The  research  reported  herein  dealt  with  the  convergence 
properties  of  stochastic  gradient  descent  algorithms  as 
applied  to  the  sequential  adaptation  of  array  processors. 

In  Chapter  IV,  sufficient  conditions  were  derived  for  the 
convergence  of  these  algorithms.  A  new  convergence  theorem 
and  proof  for  certain  stochastic  approximation  algorithms 
(u  -*  0  )  were  presented .  These  results  serve  as  a  guide 
for  deciding  whether  to  use  a  constant  n  or  a  decreasing 
M-n  when  the  dynamics  of  the  nonstationarity  are  known. 
However,  in  general,  one  has  incomplete  a  priori  information 
concerning  the  type  of  nonstationary  environment  to  be 
encountered.  For  this  reason  the  worst-case  analysis  pre¬ 
sented  in  Chapter  V  is  particularly  informative  as' to  the 
type  of  behavior  to  expect  of  the  algorithm  in  general. 

Representative  curves  for  the  bounds  derived  in  Chapter  V 
for  the  three  types  of  nonstationarities  considered  there  are 
shown  in  Fig.  7.1.  It  should  be  emphasized  that  these 
bounds  are  of  a  worst- case  nature.  They  are  summarized  by 
the  three  theorems: 
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Theorem  5.1  (Bounded  Increment)  If 


0  <  4  < 


2\  *  * 

and  llw _ -  WnH  £  A 


\*2  2 
X  +  °2 


then 


n+1 


lim  supE[||M  .  M*i|2,  < 
n-**  n  n1  1 


A  +  M-o, 


1-  yi-2^  +  42(A*2+  o2) 


(5.4) 


% 


4 


Corollary  5.1.1.  (Bounded  Variation)  If 


0  <  4  < 


and 


£  \  <  «•  , 
x  * 


then 


lim  sup 
n  -*00 


Et!!wn-w*l|2]  ^ 


40^ 


2\  -  4(* 


+  o2) 


(5.6) 


t 


Representative  curves  showing  bounds  on  system  performance 
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B.  OTHER  APPLICATIONS  OF  THE  ALGORITHM 

The  algorithm  developed  in  this  research  could  equally 
well  be  applied  to  a  number  of  problems  in  the  system  sciences 
such  as  system  identification,  process  control,  and  pattern 
recognition  when  the  underlying  statistics  are  alllowed  to 
vary  in  time.  A  large  nuxnber  of  authors  [25]-  [ 30 ]  have 
considered  the  application  of  stochastic  approximation  theory 
to  these  problems  in  the  stationary  case,  but  rather  limited 
consideration  has  been  given  to  the  time- varying  problem 
[23]-  [24]. 
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C.  RECOMMENDATIONS  FOR  FURTHER  WORK 

The  following  problem  areas  have  been  suggested  by  the 
research  reported  in  this  paper: 

i)  Although  a  mechanism  has  been  demonstrated  for  the 
behavior  of  the  adaptive  processor  in  non- stationary 
environments,  an  optimum  choice  for  the  gain 
constant  M-  is  impossible  unless  reasonably  complete 
a  priori  knowledge  of  the  nature  of  the  time- 
variability  is  available.  A  procedure  for  auto¬ 
matically  adjusting  M-  is  desirable.  An  original 
algorithm,  based  on  the  method  of  steepest  descent, 
for  adapting  H  is  found  in  Appendix  F.  Even  though 
the  procedure  has  been  shown  to  work  well  experi¬ 
mentally,  a  theoretical  proof  of  convergence  would 
be  desirable.  (The  corresponding  deterministic 
algorithm  is  discussed  in  Appendix  G.) 

ii)  The  analysis  presented  in  this  research  did  not 
exploit  the  linearity  property  of  the  gradient 
Jn  ^  incorporating  this  additional  property 

into  the  analysis,  one  might  be  able  to  obtain 
tighter  bounds  on  the  system  performance.  As  an 
indication  of  how  this  might  be  done,  see  Appendix  H 
for  a  discussion  of  the  scalar  problem. 
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iii) 


Much  attention  has  been  given  to  comparing 
stochastic  approximation  algorithms  and  the 
Kalman^ Bucy  filter  [55]-  [57].  The  analysis 
presented  here  enlarges  upon  the  problems  for 
which  the  two  methods  can  be  compared .  A  further 
elaboration  on  this  topic  could  prove  fruitful, 
since  the  algorithm  (3.5)  is  computationally 
simpler  to  implement  than  the  corresponding 
Kalman-Bucy  filter.  For  a  further  discussion, 
see  Appendix  I. 
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APPENDIX  A 

A  LEMMA  ON  THE  LIMIT  SUPERIOR 
FOR  A  SEQUENCE  OF  NON- NEGATIVE  REAL  NUMBERS 


The  following  lemma  establishes  an  important  algebraic 
property  of  the  limit  superior  of  a  sequence  of  non- negative 
real  numbers. 


Lemma.  Let  fxn)  be  a  non- negative  sequence  of  real 
numbers.  Define 


(*2)* 


lim  sup  x 
_  n 

n  -»  m 

lim  sup  x? 
_  n 

n  — >  «• 


then 


(X2)*  -  (X*)2 


Proof  of  the  Lemma. 

2  *  *2 

Assume  the  contrary.  Let  (x  )  <  (x  )  .  Let  a>0 

2  *  2  ★  2 

be  such  that  (x  )  <  a  <  (x  )  .  Now  there  exists  an 

2  2  * 

N^  such  that  for  all  n  2  it  follows  xR  ^  (x  )  +  e1  , 

2  2  * 

where  we  let  e1  =  a  -  (x  )  .  This  implies  xn  £  a  . 

* 

But  for  all  €«  >  0  ,  it  is  the  case  that  x„  >  x  -  e„ 

2  n  2 

* 

infinitely  often.  Let  €2  =  x  -  a  .  Then  xR  >  a  infinitely 
often.  Hence,  a  contradiction.  Now  suppose  that 
(x*)2  <  (x2)*  .  Let  a>0  be  such  that  (x*)2  <  a2  <  (x2)*  . 
Since  there  exists  an  N2  such  that  for  all  n  ^  Nj  we 
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have  xn^x  +  Cj  where  £3  =  a-x*.  Thus,  xn  <  s  . 

3ut  for  all  c4  >  o  „e  have  x2  >  (x2)*  .  ^  infinitely 
often.  Letting  e,  .  (x2,*  .  s2  We  have  x2  >  a2  .  A9ain. 
a  contradiction.  Thus,  it  nrnst  be  the  case  that 

(x2)*  =  (X*)2  . 
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APPENDIX  B 

PROOF  OF  KRONECKER'S  LEMMA 


This  appendix  presents  the  proof  of  Kronecker's  lemma 
stated  in  Chapter  IV.  The  lemma  is: 


Lemma  4-2.L  (Kronecker)  Let  {X^}  be  a  sequence  of  real 

numbers.  Let  {a^}  be  a  sequence  of  positive  numbers 
converging  monotonies lly  upward  to  infinity.  If 

=  sr  converges  to  some  finite  number,  say  s  , 


n 


r  A 


then 


lim 

n-»« 


i  n 

—  T 

n  k=l 


0  . 


Proof  of  Lemma  4-2,1- 

Before  proving  this  lemma  we  need  Abel 1 s  lemma  on 
partial  summation: 

Lemma  (Abel).  Let  (yn)  and  (zn)  be  sequences.  Define 
n 

s_  =  7  y.  .  If  m  >  n  ,  then 
n  15 

m- 1 

jzj  '  <Vm- Vn-1>  +  E  WVl’  • 

3  *n 


Proof  of  Abel's  Lemma. 

Noting  that  =  8j  "  sj-l  ' 


we  may  write 
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III  III 

yjzj =  j^1zj  (3j  - 3 j- 1 > 
m  m 

=  Z  zjsj  ’  Z  zjsj-l 

3=n  J  J  3*n  J  J 


z  s 
m  m 


m-i 

+  Y  z.s.  -  )  z.  ,s.  -  z  s 

j-n  3  3  £n  j+1  3  n  n-1 


<Vm-  Vn-11  +  £nV*j  ‘  zj+l) 


This  completes  the  proof  of  Abel1 s 


xi 

Defining  y..  =  ,  zj  “  aj  '  so  =  0  '  ao“0'  we  have  bY 


Abel's  lemma 


n  *fcl 

£*i  *  ansn  +  40sj  (aj  -  aj+l>  • 


Using  the  identity 


)  =  1 


we  may  write 


t  I>n- 

n  j=l  J  n  ]=0 


sj)(aj+x-aj>  • 


1 

To  show  —  Y,  x<  converges  to  zero,  we  use  the  repeated 
n  j=l  ^ 

application  of  the  triangle  inequality  to  obtain 
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APPENDIX  C 

A  MARTINGALE  CONVERGENCE  THEOREM 

The  purpose  of  this  appendix  is  to  prove  the  following 
lemma,  needed  in  the  proof  of  Theorem  4.2. 

Leans  4.2.2.  Let  (Cl,  5,  P)  be  a  probability  space.  Let 

fxn,  n)  be  a  stochastic  sequence  on  (Cl,  P),  with 

5n  —  yn+l  *  Let  be  a  sequence  of  non- negative 

real  numbers.  Assume  the  following  conditions  hold 

(HI)  sup  E[  |X.  |]  <  oo 
k  * 

00 

(H2 )  E  a  <  oo 
1  K 

(H3 )  E[Xk+1l for  all  k  a.e. 

Then, 

limX  =Xm  where  E[  |X  I]  <  00  a.e. 

ft  00  1  00  1  * 

n-»°° 


Remark :  If  a^  *  0  for  all  k  ,  then  by  the  basic  martingale 
convergence  theorem  [61]  #  the  above  conclusion  follows. 

The  effect  of  the  a^  is  to  translate  the  X^  . 

00 

Since  £  a.  <  °°  ,  one  should  expect  the  above  lemma. 

1  * 

We  now  formalize  this  observation. 


Proof  of  the  Lemma. 
Define 

submartingale  since 


Note  that  is  a 
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% 


' 


m 


E(zk+iiV 


E  a 


i=k+l 


4 


Et3W,k1 


E 


a 


i=k+l 


i 


t  follows  immediately  that 


E  ^  =  Zj,  • 

i=k  ^  K 


Efl^l]  EfJXjJ] 


OO  00 

+  <  sup  Ef  1x^1]  +  ^  a^ 


and  by  the  martingale  convergence  theorem  [61]  , 

lim  Z.  =  Z  and  E[|Z  |1  <  °°  a.e. 
k-»oo  ^  00  00 


Since  conve^9es  by  the  monotone  convergence  theorem, 

it  must  be  the  case  that 


lim 

k~*°° 


*k  = 


=  X 


a.e. 


Moreover,  since  X  =  Z  +  J'  a.  ,  E[  |x  |  ]  < 

oo  oo  ^  K  ® 


This  completes  the  proof  of  the  lemma. 


.Ji 
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APPENDIX  D 

AN  INEQUALITY  BETWEEN  THE  ABSOLUTE  MOMENTS  AFOOT  ZERO 
OF  ORDER  1  AND  ORDER  2 


An  inequality  useful  in  proving  convergence  theorems 


Lemma .  Let  x  be  a  random  variable  on  some  probability 


space  (ft,?',  P) .  Then,  for  all  e  >  0 
E(|x|]  + 


Proof  of  the  Lemma. 

The  trick  is  to  note  that  for  all  a  >  0  and  e  >  0 


^(-f  )  >  i 

Letting  a  =  (E[jx|  ])^  ,  we  can  obtain 


h(e  +  "7  E  [  |x  |2]  >  (Eflxl2]}15 


Applying  the  Cauchy^ Schwarz  inequality  (p=2  in  Holder's 
inequality)  we  obtain  the  desired  result. 


■§  +  E-%i21  >  e[|x|)  . 


This  completes  the  proof  of  the  lemma . 
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APPENDIX  E 


SOME  ADAPTATION  ALGORITHMS 


Examples  of  some  specific  adaptation  algorithms  which 
are  members  of  the  class  of  iterative  procedures  defined  by 
(3.5)  will  be  given  in  this  appendix. 


A.  A  STOCHASTIC  APPROXIMATION  ALGORITHM 

If  the  statistics  o£  the  filtering  problem  are  wide- sense 

stationary  (i.e.,  Rn  =  R  anc*  Pn  “  P  ^or  a^  n  ^  t*ie 

stochastic  approximation  algorithm  suggested  by  Gardner  [21 ] 

* 

for  estimating  W  based  on  the  data  set 

f(dk'V  *k-l,2,...n)  =  ((dk,Xk)}J  , 


is  given  by 


w  i  =  W  +  e  x 
n+i  n  n  n  n  ' 


(E.l) 


where 


Wn  is  the  estimate  of  W*  based  on  [(dv,Xv))n_1 

Jv  1 

T 

en  =  -  WnXn  =  the  error  between  the  desired  filter 

output  and  actual  filter  output  at  time  n 
^n  *  9ain  or  weighting  constant  at  time  n  . 


For  the  stationary  problem,  the  model  for  the  optimum 
weight  vector  given  by  (3.3)  becomes 


wn+l  -  pn<V  • 


where  for  all  W 
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Fn(W)  =  W  ; 


i.e.,  for  all  n 


Thus,  the  algorithm  (E.l)  is  of  the  correct  form  to  apply 
Theorem  4.2  to  show  convergence  of  the  sequence  (wn)  to 

*  A 

W  .  All  that  one  has  to  do  is  verify  that  Y  =  -e  X 

n  n  n 

satisfies  (3.9)  and  (3.10). 

The  conditions  under  which  Y  satisfies  (3.9)  and 

n 

(3.10)  will  be  derived  in  Section  C.  Both  the  Gaussian  and 
non- Gaussian  cases  will  be  considered  there. 


CONSTANT  -  U  ALGORITHMS 


If  the  statistics  of  the  problem  are  not  stationary,  the 

if 

algorithm  given  by  Widrow  [42]  for  estimating  Wn+1  based 
on  the  data  set  ^  }  i 


+  lie  Xn 
n  n 


(E.2) 


where 

Wn  is  the  estimate  of  W*  based  on  (  (d^.X^)  }^" 1 
T 

en  =  dn  -  WnXn  =  the  error  between  the  desired  output 
and  actual  output  at  time  n 


M-  =  gain  or  weighting  constant 
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Although  this  algorithm  is  similar  to  (E.l)  ,  note  that  the 
gain  constant  is  no  longer  a  function  of  time.  This  enables 
the  I/1S  algorithm  (E.2>  to  track  a  rather  general  sequence 
of  weight  vectors  fw*}  .  (See  Chapter  V. ) 

In  many  applications,  the  desired  response  sequence 
fdn)  is  not  available.  This  is  the  case,  for  example,  with 
the  filtering  problem,  where  the  desired  response  is  the 
unknown  signal.  However,  if  the  correlation  matrix  Pn  is 
known,  Griffiths  [44]  has  suggested  the  modified  algorithm 

Wn+1  =Wn-  ^nXn"  V  <E’3) 

where 

T 

Yn  *  wnxn  =  output  of  adaptive  filter. 

With  this  algorithm,  one  does  not  need  the  desired  response 
to  be  able  to  adjust  the  filter . 

Widrow,  et  al.  [43  ]  have  proposed  an  alternate  procedure 
for  supplying  training  signals  while  simultaneously  processing 
the  received  signal.  Griffiths  [44  ]  showed  this  approach 
is  equivalent  to  his  algorithm, 

Wn+1  ■  Wn'  .  <*•♦> 

where  the  matrix  C  is  related  to  the  spatial  characteristics 
of  the  array  and  V  controls  the  temporal  characteristics. 


* 
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2 

By  allowing  p  to  become  large,  the  sequence  (wn) 
converges  in  the  mean- value  to  the  maximum-  likelihood  weight 
vector 

W*  =  R-1C  (CTR_1C)_1V  . 

Kelly  [7  ]  has  shown  that  when  the  data  pairs  (d^,*^) 
are  jointly  Gaussian,  the  maximum- likelihood  weight  vector 
is  also  the  weight  vector  which  minimizes  output  power 
E[y  ]  subject  to  the  linear  constraint  C  W  =  v  .  Rosen 
[  10]  has  developed  a  gradient  projection  method  for 
iteratively  computing  this  constrained  weight  vector.  Lacoss 
[40]  has  extended  the  technique  to  the  stochastic  design 
problem.  Frost  [48]  has  modified  this  procedure  for  imple¬ 
mentation  on  a  digital  computer.  Frost's  procedure  automa¬ 
tically  corrects  for  quantization  errors  introduced  in  the 
constraint  equation  during  adaptation. 

The  algorithm  suggested  by  Frost  is 

Wn+1  -  p<wn  -  WnV  +  Q  (E.5) 

where 

T  -IT 

p  =  i  -  c«rc)  Xc1 

Q  =  C(CTC)_1V, 

and  the  optimum  weight  vector  is  given  by 


83 


*  _1  T-l  -1 

W  =  R  ■LC(C  R  C)  V  . 

Mathematically,  algorithm  (E.5)  is  equivalent  to  the  algorithm 

Wn+1  =  P(Wn- ^ynXn  +  URW*)  +  Q  (E.  6) 

•fa 

since  PRW  =0  .  This  algorithm  satisfies  (3.5)  with 

Gn(W)  =  PW  +  Q  . 

It 

Note  also  that  W  satisfies 

*  * 

W  =  PW  +  Q  . 

Hence,  the  model  (3.3)  applies  with 

F  (W)  «  PW  +  Q  , 
n 

and 

unso  . 

Convergence  of  (E.6)  implies  convergence  of  (E.5). 

The  convergence  analysis  of  the  algorithms  given  in 
this  section  for  the  stationary  statistics  problem  follows 
from  Corollary  4.1.1  where  the  relevant  parameters  are  found 
in  a  fashion  similar  to  that  done  in  Section  c  for  the 
stochastic  approximation  algorithm  (E.l).  The  behavior  of 
(E.2-  E.6)  for  the  nonstationary  problem  may  be  obtained  by 
reference  to  the  results  in  Chapter  V.  It  should  be  noted 
that  these  results  are  far  more  general  than  any  previous 
convergence  analysis  for  (E.2-E.6)  [42]-  [48]. 
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C.  Sufficient  Conditions  for  Convergence  of  Stochastic 
Approximation  Algorithm 


1.  General  Non- Gaussian  Case 

To  verify  (3.ft),  note  that,  by  the  independent  samples 
assumption  for  [(d^.X^)}  ,  WR  and  (dn#Xn)  are  independent. 
Hence, 


E[Y„|W, 


n  '  n 


,W  ]  = 


E  [  X  X^W  |W„ 
L  n  n  n  n 


★ . 

.W  ]  - 


E[dnXJWn 


»W  ] 


E [ X  X*"  ] W  -  E[d  X  ] 
n  n  n  n  n 


*  RW 

n 


* 


RW 


=  J(W  ) 
v  n/  . 

To  verify  (3.19)  use  the  chain  of  inequalities, 

»  <xnxn-  ■»)(«„-«*)-  <XnXnW‘  '  dnXn>  »* 

<;  [  II  (XnXn  -  * )  ("n  ■ "  w* )  II  +  II  (XnXnK*  '  dnXn  > 11  ^ 2 
<  2 1|  (XnX^-  R)l|2||Wn-  W* || 2  +  2 II  (XnX^W*  -  dnXn)l|2  , 
to  show  that  by  the  independent  samples  assumption 

E[(|Yn-  J(Wn)||2|Wn,W*]  1  O2  +  o2||Wn-W*||2  , 

where 

°l  -  2Et™<XnXnW*-dnXn)l|2l 
C22  =  2E[||  (XnX^-  R)  II  ]  . 


and 
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Therefore,  by  Theorem  4.2,  under  the  independent  data  pairs 

assumption  and  existing  moments  of  order  four,  the  fw  1 

i  nJ 

given  by  (E.l)  converge  with  probability  one  and  in  mean- 
square  to  W  if  (M-n)  satisfy  the  usual  conditions. 

(i)  >  0 

(ii)  2  M-  ss  oo 

n 

(iii)  I  M-2  <  «  . 

2 .  Gaussian  Case 

If,  in  addition  to  the  assumptions  given  in  part  1, 
one  assumes  that  the  data  pair  (<3n*xn)  is  jointly  Gaussian, 
he  can  obtain  tighter  bounds  for  oj  and  o 2  .  This  is 
shown  by  the  following  argument. 

By  the  independent  samples  assumption  it  follows  that 

e[  II  (xnx£-  R)  (wn-  w*)||2  |wn,w*] 

=  (Wn-  W*)TEf(XnX^-  R)  2  ]  (Wn-  w*)  . 

Senne  [46]  has  shown  that  if  Xn  is  Gaussian,  then 

Et(xnxn-R)2]  =R2+Rtr^R5  # 

where 

tr{R)  =  X  r.  .  =  trace  of  R  . 
i=l  11 

Therefore, 

E  r  II  (XnxJ  -  R)  (Wn  -  W* )  II 2  |Wn,  W1  <  ^ax  ( Amax  +  tr  { R )  )  |lwn  -  W*  |! 2  . 


. . Mill  )  1  . 

I 

%  I  , 
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where  ^  is  the  maximvun  eigenvalue  of  R  . 
ma  x 

Define 


*  T  * 

Sn  *  dn  "  x«w  • 

n  n  n 


Since  the  data  pair  (dn*xn)  are  jointly  Gaussian,  it 

follows  that  any  linear  combination  is  Gaussian  [  ]  .  In 

* 

particular,  sn  is  Gaussian.  Moreover,  since  by  definition 
of  W*  , 


E[6*Xnl  -  0  -  E[e*]E[Xj  , 


nJ  1  n' 


it  follows  [  ]  that  sn  and  Xn  are  independent.  Hence, 

* 

en  and  any  function  of  XR  are  independent  [  ]  .  Therefore, 

Et  <Xn*y  -  dnXn)T(X„xI-  R>  |Wn'W*' 


-  Ets*]E[X^(X„X^-R)|  (W„-W*) 


n '  n  n 


n 


and 


=  0  , 


E(|,XnXnW*‘dnXn)l|2lWn'W*J  *  EC  <sn)2j  EC^xn,,2j 

=  (tr{R})£(W*)  . 


Hence , 


E[||Yn-  J  (Wn )  II 2  |Wn,W  ]  <  ^  (W  )[tr(R}j 


n 


n 


+  '*max  +  XmaxtrfR’>»Wn-W‘»2 
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APPENDIX  F 

TIME- VARYING  ADAPTATION  ALGORITHM 

One  of  the  major  problems  connected  with  using  the 
IMS  adaptation  algorithm  , 


^nXn 

n  n 


(3.5) 


discussed  in  Appendix  E  is  the  choice  of  M-  to  use  during 
adaptation.  Without  any  a  priori  knowledge  of  the  time- 
varying  characteristics  of  the  data,  one  would  like  an 
algorithm  which  automatically  seeks  out  the  optimum  value 
of  M-  without  resorting  to  a  random  trial- and- error  method. 
The  following  algorithm  was  originally  suggested  by  the 
author  to  accomplish  this  task' 


with  respect  to 
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_  e2  =  2e  . ^>+1 

oi-n  n+1  n+1  (jMn 


■  2Vl  <dn+l  -  Xn+1  wn+l> 


2en+l  Xn+1  1  wn+l  * 


2en+l  Xn+1^7  <Wn-  Wn  > 


-  -2enen+l  xn+lXn 


we  obtain  the  algorithm 


+  Ae_e_ . ,X_X_ 


n+1  n 


'n  n+1  n  n+1 


While  most  of  the  time  this  algorithm  did  perform 
quite  well  under  experimental  conditions,  there  were 
instances  for  which  it  did  diverge.  The  reason  for  this 
behavior  is  that  the  value  of  X  to  be  used 
depends  upon  the  initial  weight  vector  .  This  is 
demonstrated  by  the  following  example. 


Example. 

The  deterministic  adaptation  algorithm  equivalent  to 
(F.l)  and  (F.2)  is  given  by 


=  U,  +  *  Z  (n-l)Z(n) 
n  n- 1 


(F.3) 


Z  (n+1)  =  [I  -  ^nRj  Z  (n) 


(F  .4 ) 


where  Z(n)  is  related  to  W(n)  by 
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Z  (n)  =  R  (W  (n)  -  W*) 

R  =  covariance  matrix  of  the  input  data  (which  is 
assumed  to  be  stationary  for  this  example) 

W*  =  optimum  finite- dimensional  linear  estimator. 

We  shall  consider  the 

Z  (n)  =  zx(n) 
z2  (»>. 

Then,  our  algorithm  ' 

Un  c  ^n-1  +  A(1_kLn-lr)  (2l{n-1)  +z2^n“1^) 

*i(n+l)  *=  (1- M.nr)zi(n)  i«l,2 

Defining 

a  (n)  «=  1-  p.nr 
3  =  Xr 

we  can  write  the  above  algorithm  in  the  form 

a(n)  =  a(n-l)  -  (3a  (n-1)  (z2  (n-1)  +  z2  (n-1) 

z^  (n+1)  =a(n)zi(n)  i  =  l,2 

Letting  z^(l)  =z2(l)  ,  we  can  further  simplify  to 

a2  (n)  «  a2(n-l)  (1- 23z2(n-l))2 
z2  (n+1)  =  a2  (n)z2  (n) 

If  z2(l)  =  and  =  0  (a(l)  =  l)  where  €>0  ,  then 


case  with 


r 

Lo 


:] 


(r>0) 


becomes 
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zj(2) 

a2  (2)  =  (1  +  2e  ) 2 
zl(3)  =  ^  (1  +  2e)2 

a2  (3)  =  (l  +  2e)2fl-  2  (1  +  2e  )2  (1  +  e )]  2 
i  (1  +  2e  ) 3 


etc. 

2  2 
with  a  (n)  -*  oo  as  n  -*oo  .  Hence  z*(n)  _♦»  as  n  -»»  . 

Thus,  the  choice  of  A  (or  3  above)  for  stability  of  the 

algorithm  does  depend  on  initial  conditions. 

End  of  Example. 

Is  it  possible  to  modify  the  time- varying  algorithm  such  that 
initial  conditions  no  longer  govern  the  stability  of  the  processor? 
The  argument  to  be  presented  for  modifying  the  deterministic 
gradient  procedure  suggests  that  the  stability  of  the  corres¬ 
ponding  ms  time-varying  algorithm  is  independent  of  initial 
conditions.  No  theoretical  proof  is  available  as  yet  to  support 
this  conjecture however  the  experimental  results  at  the  conclusion 
of  this  argument  do  support  this  hypothesis. 

An  important  reason  for  why  the  stability  of  the  method  of 

steepest  descent  algorithm  doesn't  depend  on  initial  conditions 

is  linearity.  Although  the  modified  algorithm  can't  be  made 

linear,  the  norm  of  _Z  f n )  can  almost  be  made  linear  by  updating 
Un  according  to 

For  the  determinstic  gradient  procedure  when  the  ratio  A  /a^  is 
less  than  two,  convergence  can  be  proven.  See  Appendix  G  for  a  proof . 


ZT(n  -  1)Z  (n) 
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^n-1 


+  X 


Hz  (n-  1)||2 


The  algorithm  then  becomes 


ZT(n-  1)  [I  -  nn_1  R)  Z(n-l) 


4n  =  Ha-1  +  A 

|lz  (n-  1)  ||2 

(F.  5 ) 

Z  (n+1)  =  [  I-  n.n  R]  Z(n)  . 

(F.6) 

The  corresponding  stochastic  algorithm  is 

e  (n)  XT(n)X(n-l) 

(F.7) 

"n  *"  M’n"1  '  7'  e^n-1)  ||X(n-l)||2 

W(n+1)  *5  wfn)  -  nne(n)X(n) 

(F.8 ) 

where 

e  (n)  *=  difference  between  desired  output  and 

of  adaptive  filter 

X(n)  *s  data  vector  at  time  n 

actual  output 

The  stochastic  algorithm  can  be  further  modified  by  observing 

a  few  properties  of  the  deterministic  algorithm,  the  idea  being 

that  we  want  the  stochastic  algorithm  to  behave  in  as  deterministic 

way  as  possible  without  knowledge  of  the  a  priori  statistics  of 

the  problem.  Define 

R  Z(n-l) 

°n-1  llz(n-l)||2 


and  note 


* 
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where 


Since  X*  is  in  general  unknown,  the  choice  of  K  would  appear 
to  be  a  problem.  However,  experimental  evidence  suggests  that 
the  choice  of  K  is  relatively  unimportant  (the  results 
presented  here  used  K  =  »  ).  This  is  not  really  too  surprising, 
for  if  |x  becomes  too  large  we  should  expect  the  quantity 

e(n)  XT  (n)X  (n-1) 
e(n_1)  llx  (n-1)  || 2 

to  be  biased  negatively.  Thus  p.  would  tend  to  become  smaller. 

The  following  set  of  experiments  have  been  conducted  on  the 
IBM  1130  to  determine  the  behavior  of  the  stochastic  algorithm. 

The  data  was  generated  according  to  the  one- point  autoregressive 
scheme 

xt  c  at-i  Vi  +  vt 

where 

xt  =  data  value  at  time  t 
a^  =  b  sin  (wt)  +  c 

v^  ■  white,  stationary,  gauss ian  random  variable  with 
variance  one  and  zero  mean. 

The  purpose  of  the  adaptation  filter  was  to  predict  the  process 
(xt)  one- point  ahead  of  time.  Figs.  F.l  and  F.2  show  the  mean-squre- 
error  of  the  prediction  filter  as  a  function  of  the  initial  choice 
p,  for  M-  for  various  values  of  X  ,  b  ,  w  ,  and  c  .  The  averages 
were  computed  over  700  points  after  3800  adaptations.  It  should 
be  recalled  that  for  X  =  0  ,  one  has  the  basic  IMS  adaptation 
algorithm. 


It  is  easy  to  show,  just  by  repeating  the  argument 
given  for  the  U1S  adaptation  algorithm,  that  the 
algorithm  given  by 


w 


n+1 


=  W  -  \i  y 

n  n  n 


(3. 


can  be  modified  to  yield 


M-  =  P- ,  +  A 

n  n- 1 


yTY 

rnxn- 1 


n-  l1 


-1 


Other  schemes  for  varying  the  step- size  are  discussed  in 
the  references  [30  ]  -  £32  ]  . 


97 


APPENDIX  G 

A  DISCUSSION  OF  THE  DETERMINISTIC 
TIME-VARYING  ADAPTATION  ALGORITHM  OF  APPENDIX  F 


The  purpose  of  this  appendix  is  to  discuss  the  deterministic 
time- varying  adaptation  algorithm  developed  in  Appendix  F. 
First,  a  convergence  theorem: 


Theorem.  Let  the  sequence  of  vectors  (Zn)  be  generated 


by  the  pair  of  recursive  relations 

M  ,  +  X  z 

"  -1  l!Zn-lJI2 


Zzl. 


(F.5) 


Zn+1  =  ^"V^n  (F‘6 

where  R  is  a  positive-definite,  symmetric  matrix  with 

* 

eigenvalues  lying  in  the  interval  [0  <  A#  ,  A  <oo]  . 

If  A*/A#  <  2  ,  0  <  A  <  1/A*  ,  and  0^  ,  then 

lim  W  =  W*  . 
n 

n-»“ 


* 

Remark.  Since  R  is  positive  definite  and  znssR(wn"w  '  • 
the  convergence  of  Zn  to  0  is  equivalent  to 

lim  w  =  W* 

_ _  n. 


Proof  of  Theorem. 

The  first  step  is  to  derive  sufficient  conditions  on 
the  sequence  (M-n)  for  convergence  of  (F.6).  Note  that 
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i  III-  M.nR|ll|Znl| 


for  some  N  >  1  .  Now  ||l-UnR|l<l  if.  say,  for  some  6>0 


6  ^  U  <  -4-  6 

n  “  A  ' 

for  sufficiently  large  n  .  Hence. 


(G.l) 


imf  JT  ||i_  llr||  |  =  o 

-»»Lk*N  J 


lim 

n-» 


and  consequently. 


lim  Z  =  0  . 

n-»oo  n 


To  verify  that  the  generated  according  to  (F.5)  satisfy 

(G.l)  for  sufficiently  large  n  ,  proceed  as  follows. 

Note  that  (F.5)  may  be  written 


-  a- Vl>Vl+X 


where 


ztrz 

an  --n-§ 

n  flznlf2 


Therefore,  if  0<A<  -^r  and  [i _ ,  >0  ,  then 

-  n-1- 

0  <  +  (1-  AA#)Hn_i  +  A 


(G.2) 


Consequently,  if  U^>  0  ,  then  (G.2)  holds  for  all  n  . 
Upon  successive  iteration,  one  obtains 


0  < 


(1-  AA  ) 


1(4l"v*)+"^  un  £  (1"  A\)n’1(^i-^)  +-^ 

A  A  * 
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Hence,  if  0<A< — -  ,  then  for  all  s>0  there  exists  an 

A 

N  such  that  n >  N  implies 

O  5 


+  s  . 


(G.3) 


Let  r  =-t—  .  Then, 


Since  l<r<2  by  hypothesis,  for  any  0<e  <  there 


exists  a  6  >  0  ,  namely 


6  s 

A 


3uch  that  (G.l)  is  satisfied  for  sufficiently  large  n  .  Hence, 


lim  =  0  . 
_  n 
n-»<» 


This  completes  the  proof  of  the  theorem. 


An  interesting  question  arises  at  this  point.  By 
changing  M-  according  to  (F.5),  is  the  rate  of  convergence 
of  (F.6)  increased  over  that  of  using  a  constant  M-=A  ? 

A  partial  answer  is  given  by  the  following  argument.  Note 


that 


lz„+ill2  =  11(1-  XR)znll2+  (1-  *Vi,2^ii|rVi»2 


-2(1-  *■*>!!„  . 


Therefore,  if 


2(l-7V1)4n-1ZhR(I-  AR,Zn  >  d"  Aan-l)  ^n-l»RZn-lH2  '  (G‘4) 
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then  the  rate  of  convergence  of  the  sequence  (Zn)  has  been 

speeded  up  by  changing  the  step  size  M-  .  If  0  <  X  <-^  , 

X 

then  one  can  write  (G.4)  as 

2ZnRZn  —  t^N-l^n-l*2*)2#^  • 


This  inequality  holds  if  for  all  the  eigenvalues  of  R  , 

X.  ,  it  is  the  case 

2Xi  >  [(1-  Xan_1)^n_1  +  2X]X2 
or 

^  >  t(l-^on_1)Un.1+2X]  . 

From  the  proof  of  the  theorem,  if  ^  ,  then  M-  <  -5— 

A  A*  n  A* 

for  all  n  .  Consequently, 

(1-  *an-imn-l+  2A  ^  (1- XX#)^  +  2X 


1  +  XX, 


Hence,  if 


1  +  XX, 

X .  —  x^ 


±  > 


the  rate  of  convergence  has  been  increased.  Since 


1  X 

for  all  1  _<  i  p  ,  a  sufficient  condition  for  speed-up  is 


*  > 


l  +  u. 
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or 


Therefore  one  wants 


A* 

X  <  2 


9 


0<^t 


and 


0  <  A  <  min 


T*  “  1 


A* 


It  would  appear  that  these  conditions  are  far  too 
restrictive  for  the  speed-up  conditions.  In  Fig.  G.l  is 
plotted  the  convergence  curves  obtained  experimentally  for 

4r  it 

the  case  where  the  ratio  A  /A#  is  equal  to  8  (A  =  1 ) 
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APPENDIX  H 

AN  ALTERNATE  ANALYSIS  OF  ADAPTIVE  ESTIMATION 
IN  NONSTATIONARY  ENVIRONMENTS 


In  the  text  we  considered  a  state- space  representation 
for  the  dynamic  model  of  nonstationary  statistics.  Other 
models  could  also  be  developed.  In  this  appendix,  a 
first-order  two- state  Markov  model  will  be  considered  for 
the  single  weight  case.  The  results  obtained  will  be 
shown  to  apply  to  more  general  scalar  models. 

The  problem  to  be  discussed  is  defined  as  follows.  Let 
the  sequence  fwn)  be  defined  by 


w. 


n+1 


=  -  My 

n 


n 


where 


'n 


r(wn-wn)+zn 


EtzJwn'wn>  *  0 

E'zn  lwn-wn'  =  °i  +  02lfun-unl'2 


and 


=  least-mean- square- weight  at  time 


n  . 


An  example  of  this  type  of  algorithm  is  the  LMS  adaptation 
algorithm  discussed  in  Appendix  E  when  the  input  data  pair 
sequence  is  an  independent  Gaussian  random  process  with 


E[xnlVwn>  =  E^'  -  r 
E(<VwnV2)  =  °l /r 

Et<xn-r)2lwn'wnl  *  °2  • 


and 
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Let  the  optimum  weights  fw*)  be  generated  by  a  first- 

order  N- state  Markov  process  with  transition  probability 

matrix  P  =  (p^j)  •  If#  at  time  n  ,  the  Markov  process  is 

* 

m  state  1  ,  then  wft  *  9^  .  The  steady- state  probability 
vector  is  given  by 

7T  * 

where 

* 

7ri  *  steady-state  probability  that  wn  *  ^  • 

As  an  example,  consider  the  two- state  Markov  process 
shown  in  Fig.  H.l.  If  at  time  n  the  Markov  process  is  in 
state  1  (indicated  by  )  ,  then  w*  ■*  ^  ;  if  at  time  n  , 
the  Markov  process  is  in  state  2  (indicated  by  q?2  )  #  then 

it 

wn  -  ^2  *  The  transition  probability  matrix  is  given  by 


p  = 

'Pu 

P12 

1-p 

p 

_P21 

P22 

q 

f 

l-q 

The  steady-state  probability  vector  is  given  by  [60] 


TT  =  [»,.7T,] 


q 

pfq 


'  £H-q 


To  demonstrate  the  tracking  ability  of  the  algorithm, 
consider  first  the  random  process  (wn)  defined  by 


w. 


n+1 


wn-M(wn-w„)  , 


i.e.,  the  noiseless  gradient  descent  with  z  =  0  .  The 
2  *,,2 

behavior  of  b‘  =  E[li»;n-w  |l  1  18  summarized  by 
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Fig.  H.l.  Two- state  Markov  process. 
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Theorem  1.  Define  the  random  process  (wn)  by 


wn+l  *  «n- «(wn- «n) 


where  w1  is  arbitrary  subject  to  Ef^-w*)  J  <  oo  . 

,  *, 

lwnJ  is  a  stationary  random  process  with  finite  second 
moment.  If  0  <  M-r  <  2  ,  then 


,n-l  n-1 


lim  b^  *  lim  (Hr)2  T  £  (1-  Hr  )1+^E[  (w i+2-w*)  (w* .,-w!)  ] 
n-*°°  n-*°°  fiO  j=0  1+2  1  3+2  1 


Proof  of  the  Theorem  1  . 

Note  that  by  successive  iterations,  the  algorithm  can 
be  written 

wn+l  “  wn+l  *  (l-wr)n(w1-w;tl)+urX  (l-w)n-i(w*-Vk) 


Thus,  if  1 1  —  Ur  |  <  1  #  then 


bn+l  "  (wn+l_  wn+l)2l 


=  0(n)+  (M-r) 2  fi(l-ur)2n-1-3E[(w*.w;+1)(w;-w;+1 

where  lim  0(n)  =  0  .  By  the  stationarity  of  {w*}  one  has 

n-»oo  n 


★  * 


* .  * 


E[  ,wi  -  W  (wj  -  wn+l>  1  -  E'  <wn-i+2  -  wl>  (wn- j+2  *  wl> 1  ' 

2 

By  re- indexing  the  expression  for  bfi+^  ,  one  obtains  the 


result 


bn+l  *  °<n)  +  (ur)2 


n-1  r>- 1 

Z  t 

1=0  j=0 


(1-  tir)1+^c(i, j)  , 


where 
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c(i,j)  =  EC  (wi+2  "  wl^w*+2  -  w*)  ]  . 

The  conclusion  of  the  theorem  follows  immediately. 

This  completes  the  proof  of  Theorem  1. 

It  should  be  emphasized  that  this  theorem  holds  for  any 
stationary  random  process  (w^)  with  finite  second  moment. 
The  following  corollary  gives  one  a  way  of  evaluating  the 
expression  when  s 


Corollary  1.  If  the  (wn)  are  generated  by  an  N- state, 

first-order,  stationary  Markov  process  and  (wn)  are 
as  in  Theorem  1,  then 


limb2=(nr)2 

n-»<»  i=0  j  =0 


1=0  j=o  L 


>j+l_  pi+l 


+P 


li-j  l]  „ 


where 


-  {vS1} 


and  P.^  is  an  element  of  P^1  =  p  •  •  •  p . . 

m  factors 


Proof  of  Corollary  1, 

The  corollary  follows  immediately  from: 

Lemma.  Let  {@n)  be  a  first-order,  Estate,  stationary 


Markov  process  with  transition  matrix  P=  (P^j)  • 


Let  the  elements  of 


(P)m  be  denoted  by  P?^ 


Define 


(P)m  = 


Then 


?  =  9^  I  =  state  vector. 


Et<Wn>  =  ”T<P>,% 


Proof  of  the  Lemma . 

By  straightforward  calculation 


N  N 


Et®n+mSnl  ’  ^  WK+m  =  \  '  9n  *  V 1 


N  N 


■  &  &V*Vj£ 


=  9  (P)  9  . 


This  completes  the  proof  of  the  lemma. 


Noting  that  the  [  .  ]  operation  is  linear,  it  follows 


immediately  that 


lim  b2  =  lim  (ur)2 
n-»  oo  n->°° 


T  2  d-M-r ) 1+ j  9T  fl- P^+1-  P1+1+P  *1_  9 

i*0  j=0  L  J 


or  by  symmetry 
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(M-r)2  £  £  (1-  M.r)1+^qpTfl-  2P1+1  +  P  <p 

n-»oo  n-*°°  i=0  t=0  L  J 


This  completes  the  proof  of  Corollary  1. 


The  following  corollary  provides  a  nice  example  of  the 
tracking  ability  of  the  algorithm. 


Corollary  2.  Under  the  assumptions  of  Coro  Hair/  1  for  the 
two- state  Markov  process  considered  in  Fig.  1, 

.  .  .2  _  _  _  »2  _ 2  (p+q) _ 

n"  1  2(9l“  V  (2- ^r)[Hr  (1-p-q)  +  (ptq)] 


Proof  of 


Using  the  result  [  ]  ,  =  amr  v  -tt,-!  »  one  has 


-T1  T1 


I-  PJ 


-P^+1  +  pli_^  =  £l- ai+1- ai+1 F  ir2  -tt 


'rl  W1 


where  a  =  l-j>-q  .  Therefore,  by  Corollary  1, 

lim  b2  =  (M-rJ^.TT.  (q>1-'P9)2lim  J]  £  (l-M-r)i+^  (l-ai+1-a^+1+ali-^ 
n-*®  n-»a>  i=0  j  =0 

The  only  term  of  any  difficulty  in  evaluating  is 

iim  T  (1- Hr)1+:ial1"^  . 

n-»°°  i^0  j=0 

Defining  0  =  1-  M-r  ,  one  can  show 
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r1  nf«|i_jlei+j-  t  aljlaljl  V  e2nB~ |J 

Aoj=0  j  =- (rv- 1 )  1-32  j  =»- (r>- 1 )  1-32 


at 


where 


^1  ■  -4  A  ♦  a  w» 

j=-7rv-l)  1-3*  1-3' l-ap  J 


and 


t 


P2V  lj  1  „  H 


«2n  1  „  „  c.n-1  „n-l 

(  1+  23n(a0n(a(3)£ - — 


1-3' 


3-  a 


j  =- (n- 1 )  1-3 


ft2n 

C2n-  1} 


1-3' 


a 


3=  a 


In  the  limit  as  n-*°o  , 

n-1  n- 1 


ii.  E  [  =>u-Jlei+j  -■£{$  --S 

n-.«  i=0  fio  ±  a°  1-  B 


Thus,  it  follows  after  some  algebra  that 


limb2=7T7r  (55  -  ?  )2  ■  - . - 

limDn  Y2I?1  V  (2- Hr)  (nr(l-p-q)  +  (pfq) 
This  completes  the  proof  of  the  corollary. 


The  expected  squared  distance  between  the  optimum 
constant  weight  and  the  [wnj  is  given  by 

*,  * 

Corollary  3.  Let  fwn)  be  as  in  Corollary  2.  Let  wQ  be 

*  *  2  . 

that  constant  weight  which  minimizes  E[  (wn~  Wq)  J  . 
Then 


> 


A 


J,.J .mini.  AW II . pi II.MIUU.I1  . .  ■HI, HIM wupp mill, ■■ 


. . 


Ill 


, 

■j 


and 


«0  -  ,rl‘pl  +  r2v2 


r  *  *2-1 

E[  (Wn  -  WQ)  ]  =7T 


Proof  of  Corollary  3. 

It  is  a  well-known  result  [61]  that  the  constant  which 
2 

minimizes  E[(x-a)  ]  is 


Therefore, 


a  *  E[x]  . 


w0  >  +  T2<P2 


and 


Et  <w*- wo>2l  -  +  V2)2 


-  V2(<P1‘  V  ■ 


This  completes  the  proof  of  Corollary  3. 


Define 

P(M-)  =  lim  b2  -  E[  (w*-w*)2]  . 
n— »°° 


In  Pig.  H.2  is  plotted 


ptm _ 

e[(w;.w;)2] 


for  three  values  of 


p  +  q  .  It  should  be  noted  that  when  this  expression  is 
negative,  the  fwn)  has  smaller  misad justment  than  is 
possible  with  any  fixed  weight  vector. 


SMfcaftialiCHMaiailiife 


(0)  w 


Fig.H.2.  Plot  of  normalized  misad justment  as  function  of  Ur 


113 


Adding  noise  to  the  gradient  estimate,  as  originally 
discussed,  one  has 


Theorem  2.  Define  the  random  process  (wn)  by 


wn+l  *  wn  - 


where 


n 


r  (wn  -  wn)  +  zn 


where  z„  satisfies  the  three  conditions 
n 


E[zn|wn'wn1  *  0 

Etznlwn'w.‘l  -°1  +  °2(wn-  V 


2  .  J2  .  *2 


E[ziZj]  =0  i  ^  j 


Let  the  random  process  fwn)  be  stationary  with  finite 

second  moment.  If  0  <  U  <  0  ,  then 

r  +°2 

limb?  .  -  flxr)2  lin.  *£  *£  (1  -  ur)1^  (i.  j 

_ 11  ii  /«.*  .  \  _  jTTa 

2  ‘ 


n-»»  2r-H(r  +o‘)  n-»»  i=0  j=0 


HO 


2r  -  H(r2  +  o2) 


Proof  of  Theorem  2. 

Proceeding  as  in  the  proof  of  Theorem  1, 

n 

k 


wn+l “  wn+l  =  (l-W)n(w1-w*+1)+ur  (l-nr)n-1[r(w*-w*+1)-  zj 
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Squaring  and  taking  the  expectation  yields 

n  n 


b^1  =  °(n)+  (nr)2f  t  (1- nr)2r*-i-3E[  <«*-  w*+1)  <w*- w*+1 

lal  1*1  J 


)] 


0  n 

+  V  T  (1-  Hr) 
i=l 


2^i)[°21+°yi] . 


From  this  expression  a  lower  and  an  upper  bound  on  the  (b^J 
may  be  obtained.  After  a  little  algebra  it  can  be  shown 
that  both  bounds  are  equal  and  given  by 

b2  .  liB  ^ (1.  nr,i+3C(i,j) 

2r  -  M.  (r  +  )  n-*<»  i=0  j*0 


Ho 


2  2  ’ 
2r-  p.  (r*  +  Oj  ) 


where 


*  * 


C  (i«  j  )  *  E[  (wi+2-  wx)  (wj+2"  wi^ 


This  completes  the  proof  of  Theorem  2. 


For  the  two- state  Markov  process  previously  considered 
we  have  plotted 


_ P(H) 

r  *  *  i  , 

E[  (wn-  W0)  ] 


for  the  noisy  gradient  case  with  =  0  .  The  effect  of  the 

gradient  noise,  z  ,  is  to  increase  the  misad justment  and 

n 

decrease  the  magnitude  of  the  optimum  H  to  use  for  a  given 


(p,q)  pair.  This  is  strikingly  evident  by  comparing  the 
graphs  of  the  noiseless  gradient  descent  procedure  (Fig.  h. 
with  the  noisy  gradient  descent  procedure  (Fig.  H.3) 


b+d 


Plot  of  Normalized  Maladjustment  as  a  Function  of  Ur 


APPENDIX  I 


THE  ADAPTATION  ALGORITHM  AS  A  FILTER 

This  appendix  is  concerned  with  looking  at  the  time- 
varying  filtering  problem  from  an  entirely  different  viewpoint 
than  that  of  the  text.  It  will  be  shown  how  the  adaptation 
algorithm  (3.5)  can  be  used  as  a  filter.  The  performance  of 
this  system  will  be  compared  to  the  optimum  Kalman  system  for 
a  scalar  filtering  problem. 

A.  PROBLEM  STATEMENT  AND  ASSUMPTIONS 

It  will  be  assumed  that  the  target  signal  and  noise  field 
can  be  modeled  by  the  dynamic  systems 

0s(n+l)  -  Fs  0S  (n)  +  GSUS  (n)  (I. la) 

S  (n)  =  Hs  0S  (n)  +  Ys(n)  (I. lb) 

and 

®N  =  ^n  ^N  ^ ^  ®N  ^N ^  (1. 2a) 

N  (n)  =  ^  0N(n)  +  VN(n)  (1. 2b) 

where  ec  and  0..  are  the  state- vectors,  F„  and  F„  are 
S»  N  S  N 

known  matrices,  Ug  and  UN  are  random  vector  inputs  of 

T 

zero  mean  satisfying  E[Ug  (n)Ug  (m)]  =  , 

E[UN(n)UN(m)]  =  QN6nm  '  E  (UN  (n)US  (m)1  =  0,  Gg  and  are 
known  shaping  matrices,  Hg  and  Hj^  are  known  output  matrices, 

Vg  and  VN  are  random  noise  vectors  of  zero  mean  satisfying 

E[Vs(n)Vg(m)J  =  Es^nm  '  E( (m)]  -  Vnm  ' 

E (Vg  (n)VN  (m)]  =  0  ,  and  S(n)  and  N(n)  are  the  signal 
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and  noise  components  present  on  the  array  sensors  at  time  n  . 
The  received  vector  at  time  n  is  X  (n)  *  S  (n)  +  N  (n)  (see 
Fig.  1.1).  Note  the  change  in  notation  from  that  presented  in 
the  test.  It  will  aslo  be  assumed  that  vg  »  VN  *  ug  »  and  UN  are 
mutually  independent  Gaussian  random  vectors.  This  is  the  usual 
Kalman  model  for  dynamic  linear  discrete- time  random  processes. 

For  the  combined  system  model,  (I. la),  (I. lb),  (1. 2a),  and 


(I. 2b),  one  has 


where 


9  (n+1)  -  F0  (n)  +  GU(n) 
X  (n)  -  H0  (n)  +  V  (n) 


B  (n) 


«s (n) 


SN(n) 


(I. 3a) 
(1. 3b) 


U  (n)  -  Ug  (n) 


V"> 


V (n)  *  Vs(n>  +  VN(n) 
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This  system,  (I. 3a)  and  (I. 3b),  is  shown  in  Fig.  1.2. 

The  problem  is  to  determine  the  filter  {W(n+1,  j)  s  j*=l,2,  ..n) 
which  minimizes 


E[||S(n+l)  -  S(n+l|n)|r] 


where  S(n+lfn)  is  the  estimate  of  S(n+1)  given  by 


(1.4) 


S  (n+1  |n)  *  Y  W(n+1,  j)X(j)  . 

j-1 

The  S(n+l|n)  minimizing  (1.4)  is  called  the  minimum- variance 
estimator  of  S(n+1)  . 

B.  THE  OPTIMUM  RECURSIVE  ESTIMATION  FILTER 

As  shown  in  [57]  ,  the  minimunH variance  estimator  of 
S  (n+1)  ,  denoted  from  here  on  by  S  (n+1  |n)  ,  is  given  by 

S  (n+1  jn)  «  H9  (n+1  |n)  (1.5) 


where 


H  =  [Hg  0] 


(1.6) 


and  9  (n+1  |n)  is  the  minimunv- variance  estimator  of  f)(n+l) 
given  the  observations  X(1),X(2), . .  .,X(n)  . 

The  optimal  Kalman  recursive  linear  filter  for  estimating 
9  (n)  based  on  the  observations  X  (1)  ,X (2) , . .  .X (r>- 1 )  is  given 
by  (55]  -  [57] 

5  (n+1  |n)  -  F |p(n)HT^IP(n)HT  +  R)”^(X (n)  -  H §  (n|n-l)))  +3(n|n-l)j 

(1.7a) 


P 


P (n+1)  *  F |P (n)  -  P(n)HT(HP(n)HT  +  R)' 


1  HP(n)J 


FT  +  GQGT  (1.7b) 
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where  0(n|n-l)  is  the  minimum- variance  estimate  of  6{ n) 
given  the  data  observations  X(1),X(2), . .  .X(n-l)  .  P(n)  is 
the  error  covariance  matrix  at  time  n  defined  by 


(n)  =  E^e  (n)  -  £(n  |n-l)^  ^0  (n)  -  0  (n  |r>-l)^  j  . 


An  equivalent  form  for  the  Kalman  filter  given  by  (1.7a)  and  (1.7b)  is 


9  (n+1 1 


n)  =  fH 


0  (n  |n- 1)  -  K (n) 


H§  (n  [n-1)  -  X(n)^l 

(I.8a) 

i  m 

+  GQG 

( 1 . 8b ) 

R)"1  . 

(I.8c) 

(See  Fig.  1.3) . 


C.  A  RECURSIVE  FEEDBACK  FILTER  BASED  ON  THE  ADAPTATION 
ALGORITHM 

Consider  now  a  suboptimum  approach  for  estimating  S  (n) 
based  on  the  observations  X  (1)  ,X  (2 ) , . . .  ,X  (n-1)  .  The  idea 


is  to  first  estimate  0_(n)  by  0a  (n)  ,  say.  The  estimate 

3  S 


of  S  (n)  will  be  defined  by 


S  (n)  =  Hs0g  (n)  . 


(1.9) 


In  analogy  to  (2.3),  define  the  mearwquared  error  at  time  n 
by 

in  ^  Ef  !JS  (n)  -  S  (n)  ||2]  .  (1. 10) 


Using  (I. lb)  and  (1.9)  in  (I. 10),  one  has  the  expression 


£n  =  E[llHg[0s(n)-  9s(n)J  +  vg(n)||2] 


(I. 11) 


Vi 


Optimum  minimum- variance  array  processor 


An  unbiased  estimate  of  the  gradient  of  (I. 11)  with  respect 

A 

to  0  (n)  is  given  by 

9 

Ys(n)  =  H^Hs[0s(n)- 0g(n)]  -  H*Vg(n) 

o=  Hg  [  S  (n)  -  S(n)]  .  (1.12) 

The  corresponding  algorithm  for  estimating  0  (n+1)  ,  based 

S 

on  the  method  of  steepest  descent,  is 

9s(n+l)  «  Fg[0g(n)  -  HnYs(n)J  .  (1.13) 

This  algorithm  is  of  the  form  (3.5)  in  the  text. 

Note  that  the  algorithm  (1.13)  requires  knowledge  of 
either  0g(n)  or  S(n)  in  addition  to  the  signal  model 

parameters.  If  the  object  were  to  estimate  0  (n)  based  on 

s 

S  (1) , . . .S  (n-l)  ,  this  restriction  would  be  acceptable. 
However,  in  general,  one  does  not  know  the  signal  sequence 
(S(k))  . 

Suppose  we  estimate  the  total  state  vector  0  (n)  by 
0n  .  The  estimate  of  S(n)  will  be  defined  by 

S(n)  *  H0n  (1.14) 

where  H  is  given  by  (1.6).  Define  the  mean-squared  error 
at  time  n  by 

£n  £  E[||x (n)  -  X(n)||2] 

where 
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X(n)  =  H0n  . 

Proceeding  as  above,  the  algorithm  for  estimating  9  (n+1)  is 

§n+l  '  P[®n  -  W  «•«> 

where 

Yn  =  ^n'  V  -  »\ 

«  HTH0n  -  HTXn  .  (1.16) 

Comparing  (1.15)  with  the  Kalman  filter  (I. 8a),  one  should 
note  that  they  are  equivalent  if 

K  (n)  =  nn HT  . 

(See  Figs.  1.3  and  1.4.) 

The  performance  of  the  filter  given  by  (1.15)  and 
(1.16)  is  summarized  byj 

Theorem  1,  Let  (0n)  be  as  defined  by  (1.15)  and  (1.16) 
with  kin  =  M-  .  if 

nL'p(i- wfflT)i  < 1  f 

then 


lim  supE[||Sn- 0nl!2]  < 

n-*eo 


tr  f  FHTRHFT 1  +  tr  f GQGT ) 

1-  \L„[f(i-  ^HTH)] 


(1.17) 


t  2 

X  _(A)  is  defined  to  be  the  maximum  eigenvalue  of  the 
max  _ 

matrix  A  A  . 


adaptive  feedback  filter  based  on  the  IMS  algorithm. 
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Proof  of  the  Theorem. 

Subtracting  0n+1  from  both  sides  of  (1.15)  one  obtains 


»n+l*  «n+l  -  m-  UHTH1  <«n-  8n)  +liFHTVn-  GUn  .  (1.18) 


Defining 


Bn  -  E('0n- V<Sn-«„>Tl 


it  can  be  easily  shown  from  (1.17)  that 


Bn+1  =  F  [  I  -  HHTH]  Bn  [  I  -  M-HTH]  FT 

2  T  T  T 

+  U  FHaRHF  +  GQG  . 


Hence, 


trfBn+i}  =  tr(ATABn)  +  H2tr {FHTRHFT}  +  trfGQGT)  (1.19) 


where 


A  =  F  [  I  -  M-H  H] 


If  |lATA||  <  1  ,  or  A2  (A)  <  1  ,  then 

max 


tr^Bn+l^  Nnax(A)tr^Bn^  +  ^tr  fFHTRHFT)  +  tr  fGQGT)  . 


Hence,  it  follows 


lim  sup  trfB  )  <  'l2tr[FHTRHFT)  +  trf<saoT] 
_  .  n—  ,  -v  2  ... 


n-»oo 


x- wa> 


(1.17) 


This  completes  the  proof  of  the  theorem. 
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Remark:  A  tighter  bound  can  be  obtained  starting  from 


where 


Bn+1  *  “n‘T  *  * 


2  T  T  T 

S  «  H  FH  RHF  +  GQG  . 


It  can  be  shown  that  if  ^X(A)  <  1  ,  then  [Bn)  converges 
to  some  matrix,  say  B  .*  Hence, 

B  -  ABAT  +  S  . 


Taking  the  trace,  we  conclude 

tr  (b)  «  tr (ABAT)  +  tr(S} 
-  tr {ATAB}  +  tr[S} 


and  consequently 


tr[B}  < 


trJLsj 


\nin{I“A 


(1.20) 


Remark;  The  corresponding  result  for  the  Kalman  filter  is 

*  Ftp»-  P,HT(HP[oHT+  RJ^HP^jF1  +  GQGT  .  (1.21) 

In  general,  no  way  has  yet  been  bound  to  compare  these  two 
results.  However,  the  following  scalar  example  does  provide 
an  interesting  comparison. 

This  result  is  a  direct  consequence  of  fixed-point  theory  [58] 
as  applied  to  the  vector  space  of  matrices  with  the  norm  on 
this  space  defined  by  /^r  (^T A ’ ’ 


4 
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D.  COMPARISON  OF  KALMAN  AND  SUBOPTIMUM  FILTERS 

Let  be  the  signal  at  time  n  generated  according 


to 


e..,  =  yen  +  u 

n+1  n  n 


and  let  xR  be  the  received  signal  at  time  n  given  by 


x_  =  0_  +  . 

n  n  n 


By  (I. 8a),  (1. 8b) ,  and  (1. 8c) ,  the  Kalman  estimates  are  given  by 


@n+l  ■  T 


n+1 


.  p  r 
a/2  n 

+  q 


and  by  (1.15)  and  (1.16),  the  suboptimum  estimates 


3  -  =  7(9  -  M-(0  -  x  )) 

n+1  n  n  n 


bn+l  -  ^f1-  li>2  bn  +  +  «  • 


The  resulting  steady- state  solutions  are 


P  =  T 


P  r 

00 

P  +~r 

00 


+  q 


b  = 


42Y2r  +  q 
1-  Y2  (1-  H)2 


72  (1-  ^)2  <  1 


Using  the  M-  which  minimizes  the  expression  for  b 


b  (M-  .  )  =  P  . 

oo  '  opt  a> 


* 
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E.  CHOOSING  THE  4  FOR  SUBOPTIMUM  FILTER 

The  convergence  rate  of  the  filter  (1.15)  can  be  improved 

by  using  a  sequence  fun }  rather  than  a  constant  gain  4  . 

T 

In  fact,  as  pointed  out  earlier,  if  K(n)  *4  H  then  the 

n 

Kalman  and  IMS  filters  are  equivalent.  Since  this  in  general 
will  not  be  the  case,  how  does  one  pick  a  good  sequence  (4n)  ? 

An  answer  to  this  question  is  found  by  referring  back  to 
(1.19)  and  minimizing  the  R-H-S  with  respect  to  4  .  (This 
procedure  is  an  extension  of  that  given  by  Chein  and  Fu  [23]  . ) 
This  yields 


4 


tr(FTFHTHBn) 


n  tr ( FTFHT (HBnHT  +  R ) H ) 


(1.22) 


with  the  resulting  recursive  relation 

( tr  f  FTFHTHB  ) ) 2 


tr(B_.)  =  tr(FAFBj  - 


n' 


tr  {F  FH  (HBnHA  +  R)H) 


+  tr(GQG  }  . 


For  the  scalar  problem  considered  in  the  previous  example. 


the  4  found  by  using  (1.22)  are  given  by 


n _ 

+  r 


and  the  optimum  K(n)  are 

P 

K(n)  =  p-^  * 
n 

Thus,  if  bL  =  p1  ,  then  4n=K(n)  for  this  scalar  problem. 

It  is  also  interesting  to  note  that  the  choice  of  4n  doesn't 
depend  on  knowledge  of  G  or  Q  . 


1 


s 


9 


4 


131 


F.  FURTHER  COMPARISONS  OF  THE  TWO  FILTERS 

Some  further  comparisons  can  be  obtained  if  we  assume 
Q  *  0  .  For  then  the  Kalman  result  is 

P  =  F[P  -  P  HT(HP  HT  +  R)-1HP  ]FT 

00  L  OC  00  '  00  00  J 

and  consequently 

P  =  0  . 

00 

The  adaptive  feedback  filter  yields 

lim  lim  sup  E[||e-  0  || 2 ]  =  0 

(J.-»0  n-»oo 

provided  •  Thus,  in  the  limit  both  filters  can 

be  made  to  perform  arbitrary  close  to  each  other.  The 
previous  example  with  q  =  0  and  Y  =  1  provides  a  nice 
comparison. 

Example . 

Let  q  =  0  and  7  =  1  in  the  previous  example.  Assume 
pl=bl  *  The  corresponding  recursive  equations  are 

P  r 

pn+l  '  -p-n 

bn+i  =  (l-ti)2bn  +  b2r  . 

As  shown  in  Appendix  J, 

» _ !il_ 

n+1  nP1  +  r 


. . . ••  . •  -  .  - - ■-> - - 


comparison  between  the  adaptive  and  Kalman  filters. 
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and 


n+1 


in  Fig.  1.5  vehave  compared  Pn  and  bn  for  two  values  of 
[i  when  Pi  =b1  =  r  =  1  . 

Future  research  into  the  comparison  of  the  two  filters 
would  be  desirable.  Although  the  adaptive  algorithm  is 
suboptimal,  it  has  the  advantage  of  being  computationally 
simpler.  Also,  less  a  priori  statistics  are  needed  to  apply 
this  filter,  the  Gaussian  assumption  is  not  necessary,  and 
the  theory  presented  can  easily  be  extended  to  a  non-linear 

dynamic  system,  to  name  only  a  few  advantages  of  the  adaptive 
algorithm. 
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APPENDIX  J 

TWO  RECURSIVE  RELATIONS 


The  purpose  of  this  appendix  is  to  derive  two  recursive 
relations  needed  in  Appendix  I.  They  are  summarized  by  the 
following  lemma. 


Lemma.  Let  (an3  be  a  sequence  of  non- negative  real  numbers. 
Let  a  and  P  be  two  positive  real  constants  with 
a  <  1  .  Then, 


(i)  if 


0a 


n 


n 


a  +  p 
n  K 


n  >  1 


then 


n+1 


•jP 


na1  +  p 


and 


(ii)  if  an  =  aan  +  p  n  .>  1 

x 

then  an+1  = 

Proof  of  the  Lemma. 


axP 


n  (rv-  l)a^  +  P 


(i)  Assume  that 


1 


Then 


an  +  P 


(n-l)a1  +  p 


(n-l)a1  +  p 


3Za1  (n-lja^p 

(n-l)a1  +  P  &ip  +  3  (n-l)a1  +  p2 


nax  +  p  * 


This  is  the  assumed  form  of  the  relation.  It  is  easily 
verified  that  a^  and  a2  satisfy  the  formula.  Hence, 
by  induction,  the  desired  result  follows  for  all  n  . 


(ii)  Assume 


.  _  n-l.  ,  0  1-c"-1 
an  *  °  *1  +  g  l-o 


Then 


an+l  *  aan  +  P 


=  an*i +  p  JiTT  • 


This  is  the  assumed  form  of  the  relation.  It  is  easily 
verified  that  a1  and  a2  satisfy  the  formula.  Hence,  by 
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